Tuesday, February 7, 2023

Check for comments in large word document using SAX model of Open XML SDK

Open XML SDK is the only way to work with Microsoft Office files in a managed way. It works the same regardless our code runs in a desktop application or a server-side application. The first thing to know about this SDK is its reading models. 

Why do we care about reading models?

This is important as the Open XML-based .docx, and .xlsx files are nothing but zip files with a different extension. 

  • When loading the file to memory the requirement may be 8x of the actual file size. 
  • When our code runs in memory-restricted containers it errors out with an Out of Memory Exception.

What are 2 models?

We will be seeing 2 models with an example. The example is to check for any comments in a word document.

DOM model

If we are sure the size will be small and the expanded size will fit in available memory, we can use the DOm model. The code to check for any comments in the word document goes as follows.

        private bool HasComments(Stream stream)
        {
            using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(stream, false))
            {
                var commentsPart = wordDoc.MainDocumentPart.WordprocessingCommentsPart;
                return commentsPart?.Comments.Any() ?? false;
            }
        }The parameter here is a stream to make it generalized. If the input is a file it can be easily loaded to stream.

SAX model

Here we read the document part by part. The below code shows the SAX model to check whether there are any comments in a word document. 
        public bool HasComments(Stream stream)
        {
            using (var wordDoc = WordprocessingDocument.Open(stream, false))
            {
                using (var reader = OpenXmlReader.Create(wordDoc?.MainDocumentPart))
                {
                    while(reader.Read())
                    {
                        if (reader?.ElementType == typeof(CommentRangeStart))
                        {
                            return true;
                        }
                    }
                }
            }
            return false;
        }We can easily relate this with the DataSet and DataReader in ADO.Net. 

References

https://learn.microsoft.com/en-us/office/open-xml/how-to-parse-and-read-a-large-spreadsheet

No comments: