Tuesday, October 31, 2017

Exposing Parquet file to SQL 2016 as well as Hadoop (Java/Scala)

This is just an architecture post explaining the possibility of Parquet file exposed to SQL 2016 databae via polybase and other applications accessing normally. The other applications can be anything such as data analytics code running in Hadoop cluster.

Mainly this kind of integration needed when we already have an transaction database such as SQL Server and we have to analyze data. Either we can have scheduled data movement using ETL technologies or we can use polybase to move data from an internal normal table to external polybase table which is backed by parquet file. If the solution is in Azure, the parquet file can be somewere in storage. Once the data is there in parquet file format the analytics algorithms can hit the same. Parquet file is mentioned here because of their familiarity in analytics community.

Below goes such an architecture.
Since the architecture may change over the time, LucidChart diagram is embedded. Please comment if this is not working. Thanks to LucidChart for their freemium model.

Details on implementation such as code snippets are good to share in separate post.

No comments: