Microsoft BI Tools: POLYBASE

Showing posts with label POLYBASE. Show all posts

Sunday, 5 December 2021

Delta Lake support in Azure Synapse Analytics

Case
Delta Lake is already widely used in Azure Data Bricks, but now it is also available in Synapse Analytics. How can I use it there to store history and do 'time travel' in my historical data?

Synapse now supports Delta Lake

Solution
Delta Lake is now so called General Available (GA) in Synapse Analytics, but at the time of writing Microsoft is still implementing new Delta Lake features in Synapse.

This example uses a Synapse Pipeline with a Copy Data Activity to ingest data from the source and then calls a Python Delta Lake script (other languages are possible) via the Notebook activity.

Prerequisites

You need to make sure that you (for debugging) and the Managed Service Identity (MSI) of your Synapse Analytics workspace have access to the Azure Data Lake with the Role Storage Blob Data Contributor.

In the Azure Portal go to the Storage Account used by the Synapse Analytics workspace
In the left menu click on Access Control (IAM)
Click on + Add and choose Add role assignment
Search for Storage Blob Data Contributor, select the role and click on Next
Click on + Select members and find your Synapse workspace and find yourself and click Select
Optionally add an description about the why. Then click on Review + assign (twice)

At time of writing our Apache Spark Pool uses version 3.1 with Delta Lake 1.0. If you are using an older version (2.4) of Spark then you get Delta Lake version 0.6 which is slightly different. If newer versions appear then just try the newest Spark Pool.

Apache Spark pool Additional Settings

1) Code cell 1: parameters

The first code cell is for the parameters that can be overridden by parameters from the Notebook activity in the pipeline. For more details see our post about notebook parameters. For debugging within the notebook we used real values.

# path of the data lake container
data_lake_container = 'abfss://yourbronzecontainer@yourdatalake.dfs.core.windows.net'

# The ingestion folder where your parquet file are located
ingest_folder = 'parquetstage'

# The bronze folder where your Delta Tables will be stored
bronze_folder = 'bronze'

# The name of the table
table_name = 'residences'

# The wildcard filter used within the bronze folder to find files
source_wildcard = 'residences*.parquet'

# A comma separated string of one or more key columns (for the merge)
key_columns_str = 'Id'

Parameters

2) Code cell 2: import modules and functions

The second code cell is for importing all required/useful modules. For this basic example we two import s:

DeltaTable.delta.tables for handling delta tables
notebookutils for file system utilities (removing delta table folder)

# Import modules
from delta.tables import DeltaTable
from notebookutils import mssparkutils

Imports

3) Code cell 3: filling delta lake

Now the actual code for filling the delta lake tables with parquet files from the data lake. Note: code is very basic. It checks whether the Delta Lake table already exists. If not it creates the Delta Lake table and if it already exists it merges the new data into the existing table. If you have transactional data then you could also do an append instead of a merge.

# Convert comma separated string with keys to array
key_columns = key_columns_str.split(',')  
 
# Convert array with keys to where-clause for merge statement
conditions_list = [f"existing.{key}=updates.{key}" for key in key_columns]
 
# Determine path of source files from ingest layer
source_path = os.path.join(data_lake_container_bronze, ingest_folder, source_wildcard)
 
# Determine path of Delta Lake Table 
delta_table_path = os.path.join(data_lake_container_bronze, bronze_folder, table_name)

# Read file(s) in spark data frame
sdf = spark.read.format('parquet').option("recursiveFileLookup", "true").load(source_path)
 
# Check if the Delta Table exists
if (DeltaTable.isDeltaTable(spark, delta_table_path)):
    print('Existing delta table')
    # Read the existing Delta Table
    delta_table = DeltaTable.forPath(spark, delta_table_path)
 
    # Merge new data into existing table
    delta_table.alias("existing").merge(
        source = sdf.alias("updates"),
        condition = " AND ".join(conditions_list)
         
    ).whenMatchedUpdateAll(
    ).whenNotMatchedInsertAll(
    ).execute()
 
    # For transactions you could do an append instead of a merge
    # sdf.write.format('delta').mode('append').save(delta_table_path)
 
else:
    print('New delta table')
    # Create new delta table with new data
    sdf.write.format('delta').save(delta_table_path)

Adding file to Delta Lake

4) Viewing the Delta Table in notebook

If you run the notebook with the code of the first three steps a couple of times with changed/extra/less records then history will be build in the delta table. For debugging purposes you can add an extra code cell to view the data and the various versions of the data.

To check the current version of the data you can use the following code:

display(spark.read.format('delta').load(delta_table_path))

Get current version of data

And with this code you can investigage the history versions of the data. In this case there are two versions:

# Get all versions
delta_table = DeltaTable.forPath(spark, delta_table_path)
display(delta_table.history())

Get versions of data

To retrieve one specific version you could use something like this (where the 0 is the version from the above picture):

# Get one specific version
display(spark.read.format("delta").option("versionAsOf", "0").load(delta_table_path))

Get specific version of data

You can also use a datetime to retrieve data from the Delta Lake by using timestampAsOf instead of versionAsOf:

# Get one specific version with timestamp filter
display(spark.read.format("delta").option("timestampAsOf", "2021-12-05 19:07:00.000").load(delta_table_path))

Get specific version of data with datetime filter

To remove the entire Delta Lake table (and all history) you could use something like:

# Delete Delta Table (folder)
mssparkutils.fs.rm(delta_table_path, recurse=True)

Delete Delta Table

4) Viewing the Delta Table in Serverless SQL Pool
At the moment of writing you can query the Detla Lake in a Serverless SQL Pool, but you cannot yet use the 'time-travel' feature. Please upvote this feature here.

The first option is to use an OPENROWSET query within a SQL Script in your Synapse Workspace:

-- Query the Delta Lake
SELECT TOP 10 *
FROM OPENROWSET(
    BULK 'abfss://yourcontainer@yourdatalake.dfs.core.windows.net/deltalake/places/',
    FORMAT = 'delta') as rows
ORDER BY Id;

Query the Delta Lake via an OPENROWSET query

A second option is using Polybase by creating an External Table on the Delta Lake. This does requery you to create a database within the Serverless SQL Pool because you can't do that on the master database.

-- Query the Delta Lake

-- Create database because it wont work on the master database
CREATE DATABASE MyDwh;

-- Create External Data Source
CREATE EXTERNAL DATA SOURCE DeltaLakeStorage
WITH ( location = 'abfss://yourcontainer@yourdatalake.dfs.core.windows.net/deltalake/' );

-- Create External File Format
CREATE EXTERNAL FILE FORMAT DeltaLakeFormat
WITH ( FORMAT_TYPE = DELTA );

-- Create External Table
CREATE EXTERNAL TABLE Residence (
     Id int,
     Residence VARCHAR(50)
) WITH (
        LOCATION = 'places', --> the root folder containing the Delta Lake files
        data_source = DeltaLakeStorage,
        FILE_FORMAT = DeltaLakeFormat
);

-- Get Data from your Delta Lake Table
SELECT          TOP 10 * 
FROM            Residence
ORDER BY        Id

Conclusion

In this post you learned how to create and query a Delta Lake within your Synapse Analytics Workspace. The main advantage is of course that you now don't need Azure Data Bricks if you are already using Synapse. Making your Data Platform architecture just slightly more clearer and easier.

A disadvantage, at the moment of writing, is the lack of time-traveling withing the Serverless SQL Pool environment. This means you're now forced to use notebooks to create your Data Warehouse when the latest version of your data is just not enough. So please upvote this feature here. There are some more limitations and know issues in the current version, but we think at least some of them will be solved in feature updates.

Thanks to colleague Jeroen Meidam for helping!

Sunday, 3 May 2020

Use Managed Service Identity for Synapse PolyBase

Case
I a previous PolyBase example, to read data from the storage account, we stored the secret in Synapse. Is it possible to use Managed Identity instead of storing secrets in Synapse? However I can not find the Managed Identity of my SQL Server.

Azure Synapse Analytics with PolyBase reading Azure Storage Account

Solution
You can use a Managed Identity, but there are two requirements. First this only works with 'StorageV2 (general purpose v2)'. 'BlobStorage' or 'Storage (general purpose v1)' will not work! Secondly you need to register your SQL Server that hosts Synapse in your Active Directory. This will allow you to select your SQL Server within the Access control (IAM).

1) Create Storage Account
Create an Azure Storage Account and make sure the type is StorageV2 (general purpose v2). The storage account of this example is called 'bitoolsstorage' and it has a container called 'mycontainer'. You can choose your own names, but these names will be used in the example code.

Bijschrift toevoegen

2) Create Synapse
Create a Synapse Data Warehouse including a SQL Server to host it. Our SQL Server is called 'bitoolssynapseserver' and our Synapse SQL Pool (data warehouse) is called 'synapsedwh'. Again choose your own names and change those in the example code below.

Synapse SQL Pool (data warehouse)

3) Register SQL Server in AD
Next step is to register the SQL Server that hosts your Synapse DWH in the Active Directory. This will allow you to find your SQL Server in the next step as a Managed Identity. At the moment of writing this needs to be done via PowerShell and cannot be done via the portal.

We will be using Cloud Shell (PowerShell in the portal), but you can also use PowerShell (ISE) on your Windows device but then you have to execute two extra commands (login and select subscription).

Click on the Cloud Shell icon in the upper right corner (next to the searchbox). This will start PowerShell in the portal. If this is the first time using it you first need to connect it to an Azure Storage Account.
Then execute the Set-AzSqlServer command. The first parameter is the resource group where SQL Server is located. The second parameter is the name of SQL Server (without .database.windows.net) and the last parameter will assign the Managed Identity.

# PowerShell
Set-AzSqlServer -ResourceGroupName "Joost_van_Rossum" -ServerName "bitoolssynapseserver" -AssignIdentity

If you are using PowerShell on your Windows device instead of Cloud Shell then use this code

# PowerShell
# Login to Azure (popup will appear)
Connect-AzAccount

# Select your subscription
Select-AzSubscription -SubscriptionId "2c67b23a-4ba2-4273-bc82-274a743b43af"

# Assign Managed Identity
Set-AzSqlServer -ResourceGroupName "Joost_van_Rossum" -ServerName "bitoolssynapseserver" -AssignIdentity

4) Storage Blob Data Contributor
Now it's time to give your SQL Server access to the Azure Storage Account. The role we need for this according the documentation is 'Storage Blob Data Contributor', but I also tested it with 'Storage Blob Data Reader' and that works fine as well (since we are only reading data). Note: You need to be owner of the resource (group) to delegate access to others.

Go to your Storage Account from step 1
Click on Access control (IAM) in the left menu
Click on the + Add icon and choose Add role assignment
In the Role drop down select 'Storage Blob Data Contributor'
Leave the Assign access to drop down unchanged
In the Select box start typing the name of your SQL Server
Select your SQL Server and click on the Save button

Deligate Access to Managed Identity of SQL Server

5) Master Key
We are finished in the Azure portal and now its time to start with the actual PolyBase code. Start SQL Server Managed Studio (SSMS), but make sure your Synapse is not paused.

First step is to create a master key to encrypt any secrets, but only if you do not already have one (although we will not use any secrets). You can check that in the table sys.symmetric_keys. If a row exists where the symmetric_key_id column is 101 (or the name column is '##MS_DatabaseMasterKey##') then you already have a master key. Otherwise we need to create one. For Synapse a masterkey password is optional. For this example we will not use the password.

--Master key
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys WHERE symmetric_key_id = 101)
BEGIN
    PRINT 'Creating Master Key'
    CREATE MASTER KEY;
END
ELSE
BEGIN
    PRINT 'Master Key already exists'
END

6) Credentials
Next step is to create a credential which will be used to access the Storage Account. For a Managed Identity you don't use secrets:

--Credential
CREATE DATABASE SCOPED CREDENTIAL bitools_msi
WITH
    IDENTITY = 'Managed Service Identity'
;

Tip:
Give the credential a descriptive name so that you know where it is used for. You can find all credentials in the table sys.database_credentials:

--Find all credential
SELECT * FROM sys.database_credentials

7) External data source
With the credential from the previous step we will create an External data source that points to the Storage Account and container where your file is located. Execute the code below where:

TYPE = HADOOP (because PolyBase uses the Hadoop APIs to access the container)
LOCATION = the connection string to the container in your Storage Account starting with abfss.
CREDENTIAL = the name of the credentials created in the previous step.

--Create External Data Source
CREATE EXTERNAL DATA SOURCE bitoolsstorage_abfss
WITH (
    TYPE = HADOOP,
    LOCATION = 'abfss://mycontainer@bitoolsstorage.dfs.core.windows.net',
    CREDENTIAL = bitools_msi
);

Tip:
Give the external source a descriptive name so that you know where it is used for. You can find all external data sources in the table sys.external_data_sources:

--Find all external sources
SELECT * FROM sys.external_data_sources

Notice that the filename or subfolder is not mentioned in the External Data Source. This is done in the External Table. This allows you to use multiple files from the same folder as External Tables.

8) External File format
Now we need to describe the format used in the source file. In our case we have a comma delimited file. You can also use this file format to supply the date format, compression type or encoding.

--Create External Data Source
CREATE EXTERNAL FILE FORMAT TextFile
WITH (
    FORMAT_TYPE = DelimitedText,
    FORMAT_OPTIONS (FIELD_TERMINATOR = ',')
);

Tip:
Give the format a descriptive name so that you know where it is used for. You can find all external file formats in the table sys.external_file_formats:

--Find all external file formats
SELECT * FROM sys.external_file_formats

9) External Table
The last step before we can start quering, is creating the external table. In this create table script you need to specify all columns, datatypes and the filename that you want to read. The filename starts with a forward slash. You also need the datasource from step 7 and the file format from step 8.

--Create External table
CREATE EXTERNAL TABLE dbo.sensordata (
    [Date] nvarchar(50) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=bitoolsstorage_abfss, -- from step 7
    FILE_FORMAT=TextFile              -- from step 8
);

Note:
PolyBase does not like columnname headers. It will handle it like a regular data row and throw an error when the datatype doesn't match. There is a little workaround for this with REJECT_TYPE and REJECT_VALUE. However this only works when the datatype of the header is different than the datatypes of the actual rows. Otherwise you have to filter the header row in a subsequent step.

--Create External table with header
CREATE EXTERNAL TABLE dbo.sensordata2 (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=bitoolsstorage_abfss,
    FILE_FORMAT=TextFile,
    REJECT_TYPE = VALUE, -- Reject rows with wrong datatypes
    REJECT_VALUE = 1     -- Allow 1 failure (the header)
);

You can find all external tables in the table sys.external_tables.

--Find all external tables
SELECT * FROM sys.external_tables

However you can also find the External Table (/the External Data Source/the External File Format) in the Object Explorer of SSMS.

SSMS Object Explorer

10) Query external table
Now you can query the external table like any other regular table. However the table is read-only so you can not delete, update or insert records. If you update the source file then the data in this external table also changes instantly because the file is used to get the data.

--Testing
SELECT count(*) FROM dbo.sensordata;
SELECT * FROM dbo.sensordata;

Quering an external table

Conclusion
In this post you learned how to give the Managed Identity of SQL Server access to your Storage Account. This saves you some maintenance for the secrets. And you learned how to use PolyBase to read files from that Storage Account using the Managed Identity.

Monday, 23 October 2017

Use PolyBase to read Data Lake in Azure SQL DW

Case
I have a file in an Azure Data Lake Store (ADLS) folder which I want to use in my Azure SQL Data Warehouse. In a previous blog post you used PolyBase to get the data from an Azure Blob Storage container via its access keys. How can I use PolyBase to get the data from ADLS and push the content of that file to Azure SQL DW?

Azure SQL Data Warehouse - PolyBase on ADLS

Solution
In the previous blog post we showed how to read that file from an Azure Blob Storage container via its access keys using PolyBase. However ADLS does not work with those keys, but uses the Azure Active Directory to provide access. To get authorization via Azure Active Directory we need to register a 'Web app / API' application in Azure Active Directory that does the authorization for us. That sounds very difficult and the documentation on MSDN is not very helpful, but in the end it was quite easy.

a) Starting point
The starting point of this blog post is a file in an ADLS folder called 'mySubFolder'. The file was created in a previous blog post about U-SQL that can quickly process large amounts of data files in Azure Data Lake Analytics (ADLA). The name of our ADLS is 'bitools'.

Starting point: CSV file in ADLS

The content of the CSV file

a1) App registrations
Go to the Azure portal and search for Azure Active Directory in the search box located in the header. This will bring you to the Azure Active Directory from your subscription. Then click on App registrations in the menu. It will show a list of all existing registrations. Next step is to click on New application registration to create a new registration for our data lake.

New application registration

a2) New application registration
Enter a new descriptive name like 'Data Lake bitools' so you will know where it is used for. Choose 'Web app / API' as Application type and then you need to enter a URL. Since we are not using the Sign-on URL property (we use the Azure sign-on), you can just enter any url like 'https://microsoft-bitools.blogspot.com'. When complete click on the Create button.

New application registration

a3) Edit application registration - Application ID
Now search your newly created Application registration to get its Application ID. Copy that to a notepad (we need it later on). You can also edit additional properties like giving it a custom logo to make it more recognizable if you have an extensive list of app registrations.

Copy Application ID

a4) Edit application registration - Keys
Continue editing and click on Keys in the menu to create a new access key. Give it a suitable name and expirationperiod. After clicking Save make sure to copy the generated key to the same notepad as before since you can only get it once! If you lose it you have to delete and recreate it.

Create new key

a5) Active Directory ID
Now go back to your Azure Active Directory to copy the Directory ID. You can find it when you click on Properties in the menu. Copy this ID to the same notepad that now should contain three values (ApplicationID, generated key and DirectoryID).

Get Directory ID

a6) Setting access root folder
Go to your ADLS and click on Data Explorer. You are now in the root of your ADLS. Click on Access and then on Add to assign new permissions. Search for your Registered Application called 'bitools'. Then select it and click on the Select button. In the root folder we only need Execute permissions. 'Add to' should stay on 'This folder' and 'Add as' should stay on 'An access permission entry'. Click on the Ok button to confirm.

Setting permissions on root folder

If you forget to give Execute permissions to the root folder you will get an error when adding an external table later on:
EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
GETFILESTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.).
[dbd91a77-1b0a-4f11-9710-d7c1b6b05268][2017-10-21T12:26:47.7748687-07:00]: Error [GETFILESTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.).
[dbd91a77-1b0a-4f11-9710-d7c1b6b05268][2017-10-21T12:26:47.7748687-07:00]] occurred while accessing external file.'

a7) Setting access sub folder

Now we have to repeat this for our subfolder called 'mySubFolder'. Click on the folder and you should see the source file. Click on Access and then on Add to assign new permissions. Search for your Registered Application called 'bitools'. Then select it and click on the Select button. In this sub folder we need Read and Execute permissions. 'Add to' should be changed to on 'This folder and all children' and 'Add as' should stay on 'An access permission entry'. Click on the Ok button to confirm.

Setting permissions on sub folder

An alternative could be to give the bitools app read and execute rights on the root including all children. That saves you one step, but is less secure if you use your Data Lake for multiple purposes.

b) SSMS
Now its time to start with the actual PolyBase code, but before we start make sure your Azure SQL Data Warehouse is started and use SQL Server Management Studio (SSMS) to connect to your Data Warehouse. Notice that the icon of a SQL DW is different than SQL DB.

Icon SQL DW vs SQL DB

b1) Master key

In the next step we will use a credential that points to the registered application. To encrypt that credential, we first need to create a master key in our Azure SQL Data Warehouse, but only if you do not already have one. You can check that in the table sys.symmetric_keys. If a row exists where the symmetric_key_id column is 101 (or the name column is '##MS_DatabaseMasterKey##') then you already have a master key. Else we need to create one. For Azure SQL Data Warehouse a masterkey password is optional. For this example we will not use the password.

--Master key
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys WHERE symmetric_key_id = 101)
BEGIN
    PRINT 'Creating Master Key'
    CREATE MASTER KEY;
END
ELSE
BEGIN
    PRINT 'Master Key already exists'
END

b2) Credentials
Next step is to create a credential which will be used to access the subfolder in ADLS. For this you need the ID's and key from the notepad. The IDENTITY has the following format:
[ApplicationID]@https://login.windows.net/[DirectoryID]/oauth2/token
Replace ApplicationID (including the square brackets) with the ID from step a3 and DirectoryID (including the square brackets) with the ID from a5. The SECRET should be filled with the key from step a4. After setting the correct ID's and key, execute the following code:

--Credential
CREATE DATABASE SCOPED CREDENTIAL bitools_user
WITH
    IDENTITY = 'aaf0ab52-560e-40b1-b4df-caac1f0e5376@https://login.windows.net/3xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxb/oauth2/token',
    SECRET = '6LUnE4shZ4p1jUhj7/fkLH03yfbSxi2WRWre9c0yVTs=';
;

Tip:
Give the credential a descriptive name so that you know where it is used for. You can find all credentials in the table sys.database_credentials:

--Find all credential
SELECT * FROM sys.database_credentials

b3) External data source
With the credential from the previous step we will create an External data source that points to the ADLS folder where your file is located. Execute the code below where:

TYPE = HADOOP (because PolyBase uses the Hadoop APIs to access the container)
LOCATION = the connection string to ADLS (replace 'bitools' with the name of your own ADLS name).
CREDENTIAL = the name of the credentials created in the previous step.

--Create External Data Source
CREATE EXTERNAL DATA SOURCE AzureDataLakeStore
WITH (
TYPE = HADOOP,
LOCATION = 'adl://bitools.azuredatalakestore.net',
CREDENTIAL = bitools_user
);

Tip:
Give the external source a descriptive name so that you know where it is used for. You can find all external data sources in the table sys.external_data_sources:

--Find all external sources
SELECT * FROM sys.external_data_sources

Notice that the filename or subfolder is not mentioned in the External Data Source. This is done in the External Table. This allows you to use multiple files from the same folder as External Tables.

b4) External File format
Now we need to describe the format used in the source file. In our case we have a comma delimited file. You can also use this file format to supply the date format, compression type or encoding.

--Create External Data Source
CREATE EXTERNAL FILE FORMAT TextFile
WITH (
    FORMAT_TYPE = DelimitedText,
    FORMAT_OPTIONS (FIELD_TERMINATOR = ',')
);

Tip:
Give the format a descriptive name so that you know where it is used for. You can find all external file formats in the table sys.external_file_formats:

--Find all external file formats
SELECT * FROM sys.external_file_formats

b5) External Table
The last step before we can start quering, is creating the external table. In this create table script you need to specify all columns, datatypes and the filename that you want to read. The filename starts with a forward slash. You also need the datasource from step 3 and the file format from step 4.

--Create External table
CREATE EXTERNAL TABLE dbo.sensordata (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/mySubFolder/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=AzureDataLakeStore, -- from step 3
    FILE_FORMAT=TextFile            -- from step 4
);

Note:
PolyBase does not like columnname headers. It will handle it like a regular data row an throw an error when the datatype doesn't match. There is a little workaround for this with REJECT_TYPE and REJECT_VALUE. However this only works when the datatype of the header is different than the datatypes of the actual rows. Else you have to filter the header row in a subsequent step.

--Create External table with header
CREATE EXTERNAL TABLE dbo.sensordata5 (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/mySubFolder/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=AzureDataLakeStore,
    FILE_FORMAT=TextFile,
    REJECT_TYPE = VALUE, -- Reject rows with wrong datatypes
    REJECT_VALUE = 1     -- Allow 1 failure (the header)
);

You can find all external tables in the table sys.external_tables.

--Find all external tables
SELECT * FROM sys.external_tables

However you can also find the External Table (/the External Data Source/the External File Format) in the Object Explorer of SSMS.

SSMS Object Explorer

b6) Query external table
Now you can query the external table like any other regular table. However the table is read-only so you can not delete, update or insert records. If you update the source file then the data in this external table also changes instantly because the file is used to get the data.

--Testing
SELECT count(*) FROM dbo.sensordata;
SELECT * FROM dbo.sensordata;

Quering an external table

b7) What is next?
Most likely you will be using a CTAS query (Create Table As Select) to copy and transform the data to an other table since this is the fasted/preferred way in SQL DW. In a subsequent post we will explain more about CTAS, but here is how a CTAS query looks like.

--CTAS
CREATE TABLE [dbo].[Buildings]
WITH
(
    DISTRIBUTION = ROUND_ROBIN
,   CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT  [location]
,       [date]
,       [temp]
,       [hmdt]
FROM    [dbo].[sensordata]
;

In some cases you could also use an SELECT INTO query as an alternative for CTAS.

Summary
In this post you saw how easy it was to read a file from the Azure Data Lake Store and use it as a table in Azure SQL Data Warehouse. Although it did required some extra steps compared to PolyBase on an Azure Blob Storage. Jhon Masschelein (B|L|T) has a very helpful post about this matter.

In an other post we will explain the basic usage of the CTAS query which is the preferred way to handle large sets of data in Azure SQL DW and in its on-premises precursor APS (a.k.a. PDW).

Sunday, 20 August 2017

Use PolyBase to read Blob Storage in Azure SQL DW

Case
I have a file in an Azure Blob Storage container which I want to use in my Azure SQL Data Warehouse. How can I push the content of that file to Azure SQL DW?

One of the options: PolyBase

Solution
You could of course use an ETL product or Azure Data Factory, but you can also use PolyBase technology in Azure SQL DW and use that file as an external table. The data stays in the Azure Blob Storage file, but you can query the data like a regular table.

Starting position
Starting position is a file in an Azure Blob Storage container. This file was created with U-SQL in an other post to quickly process large amounts of files in Azure.

The content of the CSV file

Before we start, make sure your Azure SQL Data Warehouse is started and use SQL Server Management Studio (SSMS) to connect to your Data Warehouse. Notice that the icon of a SQL DW is different than SQL DB.

Icon SQL DW vs SQL DB

1) Master key

In the next step we will use a credential that points to the Azure Blob Storage. To encrypt that credential, we first need to create a master key in our Azure SQL Data Warehouse, but only if you do not already have one. You can check that in the table sys.symmetric_keys. If a row exists where the symmetric_key_id column is 101 (or the name column is '##MS_DatabaseMasterKey##') then you already have a master key. Else we need to create one. For Azure SQL Data Warehouse a password for that master key is optional. For this example we will not use the password.

--Master key
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys WHERE symmetric_key_id = 101)
BEGIN
    PRINT 'Creating Master Key'
    CREATE MASTER KEY;
END
ELSE
BEGIN
    PRINT 'Master Key already exists'
END

2) Credentials
Next step is to create a credential which will be used to access the Azure Blob Storage. Go to the Azure portal and find the Storage Account that contains your blob file. Then go to the Access keys page and copy the key1 (or key2).

Access keys

Then execute the following code where IDENTITY contains a random string and SECRET contains the copied key from your Azure Storage account.

--Credential
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
    IDENTITY = 'user',
    SECRET = 'JGadV/tAt1npuNwkiH9HnI/wosi8YS********=='
;

Tip: give the credential a descriptive name so that you know where it is used for. You can find all credentials in the table sys.database_credentials:

--Find all credential
SELECT * FROM sys.database_credentials

3) External data source
With the credential from the previous step we will create an External data source that points to the Azure Blob Storage container where your file is located. Execute the code below where:

TYPE = HADOOP (because PolyBase uses the Hadoop APIs to access the container)
LOCATION = the connection string to the container (replace [ContainerName] with the name of the container and [StorageAccountName] with the name of your storage account).
CREDENTIAL = the name of the credentials created in the previous step.

--Create External Data Source
CREATE EXTERNAL DATA SOURCE AzureStorage
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://[ContainerName]@[StorageAccountName].blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
);

Tip: give the external source a descriptive name so that you know where it is used for. You can find all external data sources in the table sys.external_data_sources:

--Find all external sources
SELECT * FROM sys.external_data_sources

Notice that the filename is not mentioned in the External Data Source. This is done in the External Table. This allows you to use multiple files from the same container as External Tables.

Filename not in External Data Source

4) External File format
Now we need to describe the format used in the source file. In our case we have a comma delimited file. You can also use this file format to supply the date format, compression type or encoding.

--Create External Data Source
CREATE EXTERNAL FILE FORMAT TextFile
WITH (
    FORMAT_TYPE = DelimitedText,
    FORMAT_OPTIONS (FIELD_TERMINATOR = ',')
);

Tip: give the format a descriptive name so that you know where it is used for. You can find all external file formats in the table sys.external_file_formats:

--Find all external file formats
SELECT * FROM sys.external_file_formats

5) External Table
The last step before we can start quering, is creating the external table. In this create table script you need to specify all columns, datatypes and the filename that you want to read. The filename starts with a forward slash. You also need the datasource from step 3 and the file format from step 4.

--Create External table
CREATE EXTERNAL TABLE dbo.sensordata (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=AzureStorage, -- from step 3
    FILE_FORMAT=TextFile      -- from step 4
);

Note: PolyBase does not like columnname headers. It will handle it like a regular data row and throw an error when the datatype doesn't match. There is a little workaround for this with REJECT_TYPE and REJECT_VALUE. However this only works when the datatype of the header is different than the datatypes of the actual rows. Else you have to filter the header row in a subsequent step.

--Create External table with header
CREATE EXTERNAL TABLE dbo.sensordata5 (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=AzureStorage,
    FILE_FORMAT=TextFile,
    REJECT_TYPE = VALUE, -- Reject rows with wrong datatypes
    REJECT_VALUE = 1     -- Allow 1 failure (the header)
);

You can find all external tables in the table sys.external_tables.

--Find all external tables
SELECT * FROM sys.external_tables

However you can also find the External Table (/the External Data Source/the External File Format) in the Object Explorer of SSMS.

SSMS Object Explorer

6) Query external table
Now you can query the external table like any other regular table. However the table is read-only so you can not delete, update or insert records. If you update the source file then the data in this external table also changes instantly because the file is used to get the data.

--Testing
SELECT count(*) FROM dbo.sensordata;
SELECT * FROM dbo.sensordata;

Quering an external table

7) What is next?
Most likely you will be using a CTAS query (Create Table As Select) to copy and transform the data to an other table since this is the fasted/preferred way in SQL DW. In a subsequent post we will explain more about CTAS, but here is how a CTAS query looks like.

--CTAS
CREATE TABLE [dbo].[Buildings]
WITH
(
    DISTRIBUTION = ROUND_ROBIN
,   CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT  [location]
,       [date]
,       [temp]
,       [hmdt]
FROM    [dbo].[sensordata]
;

In some cases you could also use an SELECT INTO query as an alternative for CTAS.

Summary
In this post you saw how easy it was to read a file from the Azure Blob Storage and use it as a table in Azure SQL Data Warehouse. The big advantage of PolyBase is that you only have one copy of the data because the data stays in the file. In a next post we will see how to read the same file from the Azure Data Lake Store which does not use the Access keys.
In an other post we will explain the basic usage of the CTAS query which is the preferred way to handle large sets of data in Azure SQL DW and in its on-premises precursor APS (a.k.a. PDW).