Tuesday, October 15, 2013

Reading empty MS Excel cells using OpenXML sdk

When we normally read the excel cells using open XML it will omit the empty cells. This is because the open xml is not keeping the information about the empty cells in the xml. We can verify this by doing a simple exercise mentioned below

  1. Open MS Excel 2007 or higher version and create a new excel document
  2. Add some text in the cell B2
  3. Save the file. Make sure you are saving in .xlsx format
  4. Once its saved change the .xlsx extension to .zip
  5. Extract the .zip file contents
  6. You can see the xml file where its recording only the value for cell B2. A1,A2,B1 are not at all mentioned in the xml.
Lets come back to the topic which is reading the excel file using Open XML sdk. The normal procedure is to iterate through each excel Row and do second iteration using Row.Elements<Cell>() .In the above mentioned file if we do the iteration the cell count will be 1 when we iterate second row. But if our requirement is to read and convert to DataSet or to create object model from excel, this behaviour may cause issues as the blank cells of excel are not considered by Open XML. I faced the same situation and below is the one which I felt more suitable to address the problem.

        public DataTable ReadFirstCell()
        {
            DataTable dataTable = new DataTable();
            string value = string.Empty;
            try
            {
                using (SpreadsheetDocument myWorkbook = SpreadsheetDocument.Open(_filePath, true))
                {
                    var sheet = myWorkbook.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>().First();
                    var worksheetPart = (WorksheetPart)myWorkbook.WorkbookPart.GetPartById(sheet.Id.Value);
                    IEnumerable<Row> rows = worksheetPart.Worksheet.GetFirstChild<SheetData>().Elements<Row>();
                    int RowIndex = 0;
 
                    foreach (var row in rows)
                    {
                        //Create the data table header row ie columns using first excel row.
                        if (RowIndex == 0)
                        {
                            CreateColumnsFromHeaderRow(row, dataTable);
                            RowIndex++;
                        }
                        else
                        {
                            //From second row of excel onwards, add data rows to data table.
                            IEnumerable<Cell> cells = GetCellsFromRowIncludingEmptyCells(row);
                            DataRow newDataRow = dataTable.NewRow();
                            int columnCount = 0;
                            foreach (Cell currentCell in cells)
                            {
                                value = GetCellValue(currentCell);
                                //There are empty headers which are not added to data table columns. So avoid those.
                                if (columnCount < dataTable.Columns.Count)
                                {
                                    newDataRow[columnCount++] = value;
                                }
                            }
                            dataTable.Rows.Add(newDataRow);
                        }
                    }
                }
            }

Below are the helper methods used
        private static IEnumerable<Cell> GetCellsFromRowIncludingEmptyCells(Row row)
        {
            int currentCount = 0;
            // row is a class level variable representing the current
            foreach (DocumentFormat.OpenXml.Spreadsheet.Cell cell in
                row.Descendants<DocumentFormat.OpenXml.Spreadsheet.Cell>())
            {
                string columnName = GetColumnName(cell.CellReference);
                int currentColumnIndex = ConvertColumnNameToNumber(columnName);
                //Return null for empty cells
                for (; currentCount < currentColumnIndex; currentCount++)
                {
                    yield return null;
                }
                yield return cell;
                currentCount++;
            }
        }

        public static string GetColumnName(string cellReference)
        {
            // Match the column name portion of the cell name.
            Regex regex = new Regex("[A-Za-z]+");
            Match match = regex.Match(cellReference);
 
            return match.Value;
        }

        public static int ConvertColumnNameToNumber(string columnName)
        {
            Regex alpha = new Regex("^[A-Z]+$");
            if (!alpha.IsMatch(columnName)) throw new ArgumentException();
 
            char[] colLetters = columnName.ToCharArray();
            Array.Reverse(colLetters);
 
            int convertedValue = 0;
            for (int i = 0; i < colLetters.Length; i++)
            {
                char letter = colLetters[i];
                int current = i == 0 ? letter - 65 : letter - 64; // ASCII 'A' = 65
                convertedValue += current * (int)Math.Pow(26, i);
            }
 
            return convertedValue;
        }

The only method remaining is GetCellValue(). It can be found in one of my previous post about OpenXML sdk.

7 comments:

vishal gilbile said...
This comment has been removed by the author.
vishal gilbile said...

How do we read the content from Merged Cell using open xml. I tried using the below code

MergeCells mergeList = objWorkSheetPart.Worksheet.Elements().FirstOrDefault();

foreach(MergeCell objCell in mergeList)
{
//but here through objCell all what i can get is the Reference of the Cell. How do I access the value of Merged Cell.
}

Anonymous said...

Life saver!

Unknown said...

Very usefull! thanks

Unknown said...

CreateColumnsFromHeaderRow function is missing

Joymon said...

Hi Vikram,
Sorry that I missed the function. I dont have the same code handy. This can be the pseudo code. Hope can upload the full sample later.
private void CreateColumnsFromHeaderRow(Row row, DataTable dataTable)
{
IEnumerable cells = GetCellsFromRowIncludingEmptyCells(row);
foreach (Cell currentCell in cells)
{
string value = GetCellValue(currentCell);
dataTable.Columns.Add(value);
}
}

The GetCellValue may not work out as is. So there might be some changes required.

Anonymous said...

thank you. it's works.