Case
I have a file in an Azure Blob Storage container which I want to use in my Azure SQL Data Warehouse. How can I push the content of that file to Azure SQL DW?
Solution
You could of course use an ETL product or Azure Data Factory, but you can also use PolyBase technology in Azure SQL DW and use that file as an external table. The data stays in the Azure Blob Storage file, but you can query the data like a regular table.
Starting position
Starting position is a file in an Azure Blob Storage container. This file was created with U-SQL in an other post to quickly process large amounts of files in Azure.
Before we start, make sure your Azure SQL Data Warehouse is started and use SQL Server Management Studio (SSMS) to connect to your Data Warehouse. Notice that the icon of a SQL DW is different than SQL DB.
I have a file in an Azure Blob Storage container which I want to use in my Azure SQL Data Warehouse. How can I push the content of that file to Azure SQL DW?
One of the options: PolyBase |
Solution
You could of course use an ETL product or Azure Data Factory, but you can also use PolyBase technology in Azure SQL DW and use that file as an external table. The data stays in the Azure Blob Storage file, but you can query the data like a regular table.
Starting position
Starting position is a file in an Azure Blob Storage container. This file was created with U-SQL in an other post to quickly process large amounts of files in Azure.
The content of the CSV file |
Before we start, make sure your Azure SQL Data Warehouse is started and use SQL Server Management Studio (SSMS) to connect to your Data Warehouse. Notice that the icon of a SQL DW is different than SQL DB.
1) Master key
In the next step we will use a credential that points to the Azure Blob Storage. To encrypt that credential, we first need to create a master key in our Azure SQL Data Warehouse, but only if you do not already have one. You can check that in the table sys.symmetric_keys. If a row exists where the symmetric_key_id column is 101 (or the name column is '##MS_DatabaseMasterKey##') then you already have a master key. Else we need to create one. For Azure SQL Data Warehouse a password for that master key is optional. For this example we will not use the password.
2) Credentials
Next step is to create a credential which will be used to access the Azure Blob Storage. Go to the Azure portal and find the Storage Account that contains your blob file. Then go to the Access keys page and copy the key1 (or key2).
Then execute the following code where IDENTITY contains a random string and SECRET contains the copied key from your Azure Storage account.
Tip: give the credential a descriptive name so that you know where it is used for. You can find all credentials in the table sys.database_credentials:
3) External data source
With the credential from the previous step we will create an External data source that points to the Azure Blob Storage container where your file is located. Execute the code below where:
Tip: give the external source a descriptive name so that you know where it is used for. You can find all external data sources in the table sys.external_data_sources:
Notice that the filename is not mentioned in the External Data Source. This is done in the External Table. This allows you to use multiple files from the same container as External Tables.
4) External File format
Now we need to describe the format used in the source file. In our case we have a comma delimited file. You can also use this file format to supply the date format, compression type or encoding.
Tip: give the format a descriptive name so that you know where it is used for. You can find all external file formats in the table sys.external_file_formats:
5) External Table
The last step before we can start quering, is creating the external table. In this create table script you need to specify all columns, datatypes and the filename that you want to read. The filename starts with a forward slash. You also need the datasource from step 3 and the file format from step 4.
6) Query external table
Now you can query the external table like any other regular table. However the table is read-only so you can not delete, update or insert records. If you update the source file then the data in this external table also changes instantly because the file is used to get the data.
7) What is next?
Most likely you will be using a CTAS query (Create Table As Select) to copy and transform the data to an other table since this is the fasted/preferred way in SQL DW. In a subsequent post we will explain more about CTAS, but here is how a CTAS query looks like.
In some cases you could also use an SELECT INTO query as an alternative for CTAS.
Summary
In this post you saw how easy it was to read a file from the Azure Blob Storage and use it as a table in Azure SQL Data Warehouse. The big advantage of PolyBase is that you only have one copy of the data because the data stays in the file. In a next post we will see how to read the same file from the Azure Data Lake Store which does not use the Access keys.
In an other post we will explain the basic usage of the CTAS query which is the preferred way to handle large sets of data in Azure SQL DW and in its on-premises precursor APS (a.k.a. PDW).
--Master key IF NOT EXISTS (SELECT * FROM sys.symmetric_keys WHERE symmetric_key_id = 101) BEGIN PRINT 'Creating Master Key' CREATE MASTER KEY; END ELSE BEGIN PRINT 'Master Key already exists' END
2) Credentials
Next step is to create a credential which will be used to access the Azure Blob Storage. Go to the Azure portal and find the Storage Account that contains your blob file. Then go to the Access keys page and copy the key1 (or key2).
Access keys |
Then execute the following code where IDENTITY contains a random string and SECRET contains the copied key from your Azure Storage account.
--Credential CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential WITH IDENTITY = 'user', SECRET = 'JGadV/tAt1npuNwkiH9HnI/wosi8YS********==' ;
Tip: give the credential a descriptive name so that you know where it is used for. You can find all credentials in the table sys.database_credentials:
--Find all credential SELECT * FROM sys.database_credentials
3) External data source
With the credential from the previous step we will create an External data source that points to the Azure Blob Storage container where your file is located. Execute the code below where:
- TYPE = HADOOP (because PolyBase uses the Hadoop APIs to access the container)
- LOCATION = the connection string to the container (replace [ContainerName] with the name of the container and [StorageAccountName] with the name of your storage account).
- CREDENTIAL = the name of the credentials created in the previous step.
--Create External Data Source CREATE EXTERNAL DATA SOURCE AzureStorage WITH ( TYPE = HADOOP, LOCATION = 'wasbs://[ContainerName]@[StorageAccountName].blob.core.windows.net', CREDENTIAL = AzureStorageCredential );
Tip: give the external source a descriptive name so that you know where it is used for. You can find all external data sources in the table sys.external_data_sources:
--Find all external sources SELECT * FROM sys.external_data_sources
Notice that the filename is not mentioned in the External Data Source. This is done in the External Table. This allows you to use multiple files from the same container as External Tables.
Filename not in External Data Source |
4) External File format
Now we need to describe the format used in the source file. In our case we have a comma delimited file. You can also use this file format to supply the date format, compression type or encoding.
--Create External Data Source CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DelimitedText, FORMAT_OPTIONS (FIELD_TERMINATOR = ',') );
Tip: give the format a descriptive name so that you know where it is used for. You can find all external file formats in the table sys.external_file_formats:
--Find all external file formats SELECT * FROM sys.external_file_formats
5) External Table
The last step before we can start quering, is creating the external table. In this create table script you need to specify all columns, datatypes and the filename that you want to read. The filename starts with a forward slash. You also need the datasource from step 3 and the file format from step 4.
--Create External table CREATE EXTERNAL TABLE dbo.sensordata ( [Date] DateTime2(7) NOT NULL, [temp] INT NOT NULL, [hmdt] INT NOT NULL, [location] nvarchar(50) NOT NULL ) WITH ( LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv', DATA_SOURCE=AzureStorage, -- from step 3 FILE_FORMAT=TextFile -- from step 4 );Note: PolyBase does not like columnname headers. It will handle it like a regular data row and throw an error when the datatype doesn't match. There is a little workaround for this with REJECT_TYPE and REJECT_VALUE. However this only works when the datatype of the header is different than the datatypes of the actual rows. Else you have to filter the header row in a subsequent step.
--Create External table with header CREATE EXTERNAL TABLE dbo.sensordata5 ( [Date] DateTime2(7) NOT NULL, [temp] INT NOT NULL, [hmdt] INT NOT NULL, [location] nvarchar(50) NOT NULL ) WITH ( LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv', DATA_SOURCE=AzureStorage, FILE_FORMAT=TextFile, REJECT_TYPE = VALUE, -- Reject rows with wrong datatypes REJECT_VALUE = 1 -- Allow 1 failure (the header) );You can find all external tables in the table sys.external_tables.
--Find all external tables SELECT * FROM sys.external_tablesHowever you can also find the External Table (/the External Data Source/the External File Format) in the Object Explorer of SSMS.
SSMS Object Explorer |
6) Query external table
Now you can query the external table like any other regular table. However the table is read-only so you can not delete, update or insert records. If you update the source file then the data in this external table also changes instantly because the file is used to get the data.
--Testing SELECT count(*) FROM dbo.sensordata; SELECT * FROM dbo.sensordata;
Quering an external table |
7) What is next?
Most likely you will be using a CTAS query (Create Table As Select) to copy and transform the data to an other table since this is the fasted/preferred way in SQL DW. In a subsequent post we will explain more about CTAS, but here is how a CTAS query looks like.
--CTAS CREATE TABLE [dbo].[Buildings] WITH ( DISTRIBUTION = ROUND_ROBIN , CLUSTERED COLUMNSTORE INDEX ) AS SELECT [location] , [date] , [temp] , [hmdt] FROM [dbo].[sensordata] ;
In some cases you could also use an SELECT INTO query as an alternative for CTAS.
Summary
In this post you saw how easy it was to read a file from the Azure Blob Storage and use it as a table in Azure SQL Data Warehouse. The big advantage of PolyBase is that you only have one copy of the data because the data stays in the file. In a next post we will see how to read the same file from the Azure Data Lake Store which does not use the Access keys.
In an other post we will explain the basic usage of the CTAS query which is the preferred way to handle large sets of data in Azure SQL DW and in its on-premises precursor APS (a.k.a. PDW).
Excellent , Each step is clear and perfect
ReplyDeleteHow to access multiple paths with same external connection
ReplyDeleteYou can only provide one path, but it does read subfolders. Documentation: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15
DeleteOr try something with union multiple external tables in a view...?