Tuesday, October 31, 2023

SharePoint Online - Graph API to get all files in a drive recursively even if it has more than 5000 items

Problem

As part of the validation, we must dump all the file details inside a SharePoint drive. Some more points below

  • The input is a drive graph id.
  • It has to be recursive
  • There can be 10,000+ files in the drive and the export needs to be done in a reasonable amount of time.
  • The expected output is CSV with the below schema.
  • FolderName, FileGraphId, FileName
  • FileGraphId is the unique id of the file that Graph API understands. It is not Guid or integer instead it uses a 34-character wide format like '01XXXXA4QGNCJMT7XXXVDKCNXXXV6LCC4E'

    Approaches

    There are 3 APIs to interact with SharePoint Online. The first Graph has no direct method to list all the files recursively. We have to go by individual Graph API calls to the list folders and make subsequent  Graph calls to get the folders and files inside recursively. It works but is very slow and high chance of getting throttled. There is a search API but that has no guarantee of returning all the files as the indexing will take some time. 

    If we use the PnP.Core or PnP.Frameworks, there is no way to get the Graph IDs.

    Solution

    So lastly we have to go via Graph API but get the List items (not drive) then get the driveItem of list item to retrieve the Graph Id. The code is as follows.

    var graphClient = new GraphServiceClient(tokenProvider);

    //Url for below looks like /v1.0/sites/{siteGraphid}}/lists/{listGuid}/items?$select=contentType&$expand=driveItem HTTP/1.1
    var listItems = await graphClient
                            .Sites[siteGraphId]
                            .Lists[listGuid] // Need to get the ListId from DriveId
                            .Items
                            .GetAsync(req=> {
                                                req.QueryParameters.Expand = new []{"driveItem"};
                                                req.QueryParameters.Select = new []{"contentType"};
                                            });
    listItems
    .Value
    .Where(item=>item.ContentType.Name == "Document")
    .ToList()
    .ForEach(item=>Console.WriteLine($"FolderName:{item.DriveItem.ParentReference.Name},File Graph Id:{item.DriveItem.Id}, File Name: {item.DriveItem.Name}"));

    The drive Id needs to be translated to ListId to get this working. The QueryParameters.Select is used to limit the data to mainly 2 elements: driveItem and contentType.

    Full code

    The above code gives the idea how the required data can be pulled. Below goes full code that can be run inside Polyglot notebook.
    #r "nuget:Microsoft.Graph"
    #r "nuget:Azure.Identity"
    #r "nuget:CsvHelper"
    using Microsoft.DotNet.Interactive;
    using Microsoft.Graph;
    using Microsoft.Graph.Models;
    using Azure.Core;
    using Azure.Identity;
    using System.Collections.Generic;
    using CsvHelper;
    using CsvHelper.Configuration;
    using System.Globalization;
    public class FooMap : ClassMap<(string  FolderName,string FileGraphId,string FileName)>
    {
        public FooMap()
        {
            Map(m => m.FolderName).Name("FolderName");
            Map(m => m.FileGraphId).Name("FileGraphId");
            Map(m => m.FileName).Name("FileName");
        }
    }
    TokenCredential tokenprovider = new UsernamePasswordCredential(
                                "{userName@tenant.onmicrosoft.com}",
                                "{password}",
                                "{AAD Tenant Id}",
                                "{AAD App registration id}");
    var siteGraphId="{3 part graph Id of site}";
    var listGuid = "{Guid of List}";// Get this using Drive.
    var graphClient = new GraphServiceClient(tokenprovider);
    //Url for below looks like /v1.0/sites/{siteGraphid}}/lists/{listGuid}/items?$select=contentType&$expand=driveItem HTTP/1.1
    ListItemCollectionResponse listItems = await graphClient
                            .Sites[siteGraphId]
                            .Lists[listGuid]
                            .Items
                            .GetAsync(req=> {
                                                req.QueryParameters.Expand = new []{"driveItem"};
                                                req.QueryParameters.Select = new []{"contentType"};
                                            });
    var finalItems = new List<(string FolderName,string FileGraphId,string FileName)>();
    var pageIterator = PageIterator<ListItem,ListItemCollectionResponse>
    .CreatePageIterator(graphClient,listItems,(item)=>{
        if(item.ContentType.Name == "Document")
        {
            finalItems.Add((FolderName:item.DriveItem.ParentReference.Name,FileGraphId:item.DriveItem.Id,FileName:item.DriveItem.Name));
        }
        return true;
    });
    await pageIterator.IterateAsync();
    using(var outputFileWriter= new StreamWriter(@"c:\temp\export.csv"))
    {
        using (var csv=new CsvWriter(outputFileWriter,CultureInfo.InvariantCulture))
        {
            csv.Context.RegisterClassMap<FooMap>();
            csv.WriteRecords(finalItems);
        }
    }
    finalItems.ForEach((item)=>Console.WriteLine($"FolderName:{item.FolderName},File Graph Id:{item.FileGraphId}, File Name: {item.FileName}"))
    Each(item=>Console.WriteLine($"FolderName:{item.DriveItem.ParentReference.Name},File Graph Id:{item.DriveItem.Id}, File Name: {item.DriveItem.Name}"));

    The authentication used here is the ROPC username password. It needs to be changed based on best practices. It uses a nuget package for writing a list as a CSV file.

    References

    No comments: