Tuesday, February 28, 2023

SharePoint Online / Office 365 is modifying the uploaded office files

When we use remote cloud storage services, one of the basic expectations is file integrity. In simple words, we should get the file back that we uploaded without any bit modified. If it is a small text file we can eyeball the file copy before upload and the file downloaded. But if the file is a JPEG image file or MKV video file, it is tough to compare.

Can we rely on the file size to ensure it is not modified?

The file size can be a basic check but never 100% reliable. Sometimes the file size may be the same but the bits inside might be different. 2 files with the same size may be different at the bit level.

In computer science, we use the checksum to ensure the file is intact. Even if the bit change, the checksum comparison will fail. Now think from the developer's perspective. In case users complain that the file is modified we can easily 

Coming to the cloud-based SharePoint Online, there is a document library where we can upload files. Let us see whether this checksum validation works with SharePoint Online.

SharePoint Online checksum comparison

If the file type is not an Office file, the checksum comparison works fine. Below is the easiest method to find the checksum using PowerShell.

Hope the code is simple. 

Now coming to office files. Forget about checksum comparison, the file sizes itself doesn't match. Have a look at the below screenshot.
The file downloaded has got extra 6 KBs of data. Obviously, the checksum will be different.

What data is getting added by SharePoint Online?

As we all know the .docx or any office file is nothing but a bunch of files zipped and renamed the extension. We can easily rename the extension to .zip and extract.
As we can see in the above screenshot, there is a new folder called 'customXml' created in the downloaded file. The folder size is around 10.2KB.
If we open the 'customXml' folder we can see some files prefixed with item*

Is it documented?

Unfortunately not documented anywhere. As per Microsoft it is their internal design and cannot be disclosed. Below are some links that discuss the same on the internet.

No comments: