Tuesday, September 19, 2017

SQL 2016 - Create external polybase table on parquet file stored in Azure blob storage

This is a walk through on creating an external polybase table in SQL 2016 which stores data in Azure blob storage using parquet file format.

Prerequisite

The prerequisite is the basic knowledge about SQL Server and Microsoft Azure.

Use cases

One of the best use case is to move data between SQL Server transaction systems to Azure blob storage. Mainly if we are in data analytics world, there we can rarely see transactional relation databases. So we has to move data in between. eg: if we have internal or normal transaction table we can create external polybase table with same schema and do insert..select to move data from SQL Server to unstructured stores such as Azure blob or hadoop.

Azure storage account and blob setup

This is as straight forward as creating Azure storage account and blob container for any other purpose. Note down the storage account key and the container name to be used in SQL to setup connection.

Installing SQL 2016 with Polybase support

The polybase support came with SQL 2016. This Microsoft SQL Server 2016 feature requires Java runtime!!! Yes. Polybase seems a wrapper over Java libraries and those need JVM to run. Min inum required version for running polybase at the time of writing this post is JRE 7 Update 51

Once we have the JRE 1.7 version installed, we can install SQL 2016 with polybase support. Polybase has to be selected manually during the installation as it is not selected by default.

SQL Server level Settings

There are some settings to be done at the SQL Server level.

Enable Polybase

The below query will enable polybase export.

-- Enable INSERT into external table  
sp_configure 'allow polybase export', 1;  
reconfigure

Create password

This is the master key

-- Create a master key.
-- Only necessary if one does not already exist.
-- Required to encrypt the credential secret in the next step.
Create MASTER KEY   ENCRYPTION BY PASSWORD = '<Password>'

Database level settings

This is the step where the credential towards the Azure blob storage is setup in SQL Server database.

-- Create a database scoped credential
-- IDENTITY: Provide any string, it is not used for authentication to Azure storage.
-- SECRET: Provide your Azure storage account key.
go
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
    IDENTITY = 'user',
    SECRET = '<storage account key>'
; 
This doesn't tell what is the location the data resides. It just opens entire storage account to SQL Server.

Create data source

Now we can use the stored credentials to map to the Azure blob container path. This just tells where in the Storage account the data goes. The wasbs:// protocol is used to point blob storage path.

-- Create an external data source
-- TYPE: HADOOP - PolyBase uses Hadoop APIs to access data in Azure blob storage.
-- LOCATION: Provide Azure storage account name and blob container name.
-- CREDENTIAL: Provide the credential created in the previous step.

--  Example with sample Storage account,container name:polytest,storage account name: parquetstore
CREATE EXTERNAL DATA SOURCE AzureStorage
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://polytest@parquetstore.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
); 

Create file format

One more setup required before creating the table is creation of the file format. This is required to tell what format to be used to store the table data.

-- Create an external file format
-- FORMAT_TYPE: Type of file format in Azure storage (supported: DELIMITEDTEXT, RCFILE, ORC, PARQUET).
-- FORMAT_OPTIONS: Specify field terminator, string delimiter, date format etc. for delimited text files.
-- Specify DATA_COMPRESSION method if data is compressed.

CREATE EXTERNAL FILE FORMAT [ParquetFileFormatSnappy]  
WITH (  
    FORMAT_TYPE = PARQUET,  
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'  
);

The data compression uses is Snappy. There are others as well.

Create external table

This is the last step. It creates external table with the file format, storage location and data source. 

-- CREATE AN EXTERNAL TABLE WITH PARQUET FILE.   
--A: Create the external table
-- Specify column names and data types. This needs to match the data in the sample file.
-- LOCATION: Specify path to file or directory that contains the data (relative to the blob container).
-- To point to all files under the blob container, use LOCATION='.'
CREATE EXTERNAL TABLE [dbo].[SampleTable_Parquet_External] (  
        [customerName] [nvarchar](100) NULL,
 [customerNumber] [varchar](70) NULL
)  
WITH (  
        LOCATION='/customerdata.parquet',  
        DATA_SOURCE = AzureStorage,  
        FILE_FORMAT = [ParquetFileFormatSnappy],  
        REJECT_TYPE = VALUE,  
        REJECT_VALUE = 0  
);

CRUD operations on external table

As it is stated by polybase technology, it don't support DML operations such as update, delete statements. We can just insert.

--Insert Data to the external table
INSERT INTO [dbo].[SampleTable_Parquet_External]
select * from dbo.[SampleTable_Internal]

Things to remember about polybase

The API seems granular. We can create file format, data source and all separately and use those when creating table.

As stated earlier it is mainly useful in big data analytics. But when we tried moving more than 300,000 records from SQL Server to blob we could see it just breaks. Still troubleshooting with Microsoft. Hopefully it might be something we did wrong.

Another issue faced was about data types. The polybase engine don't know all the data types and can move them. There are settings to say how may mismatches it should handle/forgive before it fails.

More limitations can be found here.

More reading

No comments: