Microsoft BI Tools: SQL

Showing posts with label SQL_DW. Show all posts

Sunday, 5 December 2021

Delta Lake support in Azure Synapse Analytics

Case
Delta Lake is already widely used in Azure Data Bricks, but now it is also available in Synapse Analytics. How can I use it there to store history and do 'time travel' in my historical data?

Synapse now supports Delta Lake

Solution
Delta Lake is now so called General Available (GA) in Synapse Analytics, but at the time of writing Microsoft is still implementing new Delta Lake features in Synapse.

This example uses a Synapse Pipeline with a Copy Data Activity to ingest data from the source and then calls a Python Delta Lake script (other languages are possible) via the Notebook activity.

Prerequisites

You need to make sure that you (for debugging) and the Managed Service Identity (MSI) of your Synapse Analytics workspace have access to the Azure Data Lake with the Role Storage Blob Data Contributor.

In the Azure Portal go to the Storage Account used by the Synapse Analytics workspace
In the left menu click on Access Control (IAM)
Click on + Add and choose Add role assignment
Search for Storage Blob Data Contributor, select the role and click on Next
Click on + Select members and find your Synapse workspace and find yourself and click Select
Optionally add an description about the why. Then click on Review + assign (twice)

At time of writing our Apache Spark Pool uses version 3.1 with Delta Lake 1.0. If you are using an older version (2.4) of Spark then you get Delta Lake version 0.6 which is slightly different. If newer versions appear then just try the newest Spark Pool.

Apache Spark pool Additional Settings

1) Code cell 1: parameters

The first code cell is for the parameters that can be overridden by parameters from the Notebook activity in the pipeline. For more details see our post about notebook parameters. For debugging within the notebook we used real values.

# path of the data lake container
data_lake_container = 'abfss://yourbronzecontainer@yourdatalake.dfs.core.windows.net'

# The ingestion folder where your parquet file are located
ingest_folder = 'parquetstage'

# The bronze folder where your Delta Tables will be stored
bronze_folder = 'bronze'

# The name of the table
table_name = 'residences'

# The wildcard filter used within the bronze folder to find files
source_wildcard = 'residences*.parquet'

# A comma separated string of one or more key columns (for the merge)
key_columns_str = 'Id'

Parameters

2) Code cell 2: import modules and functions

The second code cell is for importing all required/useful modules. For this basic example we two import s:

DeltaTable.delta.tables for handling delta tables
notebookutils for file system utilities (removing delta table folder)

# Import modules
from delta.tables import DeltaTable
from notebookutils import mssparkutils

Imports

3) Code cell 3: filling delta lake

Now the actual code for filling the delta lake tables with parquet files from the data lake. Note: code is very basic. It checks whether the Delta Lake table already exists. If not it creates the Delta Lake table and if it already exists it merges the new data into the existing table. If you have transactional data then you could also do an append instead of a merge.

# Convert comma separated string with keys to array
key_columns = key_columns_str.split(',')  
 
# Convert array with keys to where-clause for merge statement
conditions_list = [f"existing.{key}=updates.{key}" for key in key_columns]
 
# Determine path of source files from ingest layer
source_path = os.path.join(data_lake_container_bronze, ingest_folder, source_wildcard)
 
# Determine path of Delta Lake Table 
delta_table_path = os.path.join(data_lake_container_bronze, bronze_folder, table_name)

# Read file(s) in spark data frame
sdf = spark.read.format('parquet').option("recursiveFileLookup", "true").load(source_path)
 
# Check if the Delta Table exists
if (DeltaTable.isDeltaTable(spark, delta_table_path)):
    print('Existing delta table')
    # Read the existing Delta Table
    delta_table = DeltaTable.forPath(spark, delta_table_path)
 
    # Merge new data into existing table
    delta_table.alias("existing").merge(
        source = sdf.alias("updates"),
        condition = " AND ".join(conditions_list)
         
    ).whenMatchedUpdateAll(
    ).whenNotMatchedInsertAll(
    ).execute()
 
    # For transactions you could do an append instead of a merge
    # sdf.write.format('delta').mode('append').save(delta_table_path)
 
else:
    print('New delta table')
    # Create new delta table with new data
    sdf.write.format('delta').save(delta_table_path)

Adding file to Delta Lake

4) Viewing the Delta Table in notebook

If you run the notebook with the code of the first three steps a couple of times with changed/extra/less records then history will be build in the delta table. For debugging purposes you can add an extra code cell to view the data and the various versions of the data.

To check the current version of the data you can use the following code:

display(spark.read.format('delta').load(delta_table_path))

Get current version of data

And with this code you can investigage the history versions of the data. In this case there are two versions:

# Get all versions
delta_table = DeltaTable.forPath(spark, delta_table_path)
display(delta_table.history())

Get versions of data

To retrieve one specific version you could use something like this (where the 0 is the version from the above picture):

# Get one specific version
display(spark.read.format("delta").option("versionAsOf", "0").load(delta_table_path))

Get specific version of data

You can also use a datetime to retrieve data from the Delta Lake by using timestampAsOf instead of versionAsOf:

# Get one specific version with timestamp filter
display(spark.read.format("delta").option("timestampAsOf", "2021-12-05 19:07:00.000").load(delta_table_path))

Get specific version of data with datetime filter

To remove the entire Delta Lake table (and all history) you could use something like:

# Delete Delta Table (folder)
mssparkutils.fs.rm(delta_table_path, recurse=True)

Delete Delta Table

4) Viewing the Delta Table in Serverless SQL Pool
At the moment of writing you can query the Detla Lake in a Serverless SQL Pool, but you cannot yet use the 'time-travel' feature. Please upvote this feature here.

The first option is to use an OPENROWSET query within a SQL Script in your Synapse Workspace:

-- Query the Delta Lake
SELECT TOP 10 *
FROM OPENROWSET(
    BULK 'abfss://yourcontainer@yourdatalake.dfs.core.windows.net/deltalake/places/',
    FORMAT = 'delta') as rows
ORDER BY Id;

Query the Delta Lake via an OPENROWSET query

A second option is using Polybase by creating an External Table on the Delta Lake. This does requery you to create a database within the Serverless SQL Pool because you can't do that on the master database.

-- Query the Delta Lake

-- Create database because it wont work on the master database
CREATE DATABASE MyDwh;

-- Create External Data Source
CREATE EXTERNAL DATA SOURCE DeltaLakeStorage
WITH ( location = 'abfss://yourcontainer@yourdatalake.dfs.core.windows.net/deltalake/' );

-- Create External File Format
CREATE EXTERNAL FILE FORMAT DeltaLakeFormat
WITH ( FORMAT_TYPE = DELTA );

-- Create External Table
CREATE EXTERNAL TABLE Residence (
     Id int,
     Residence VARCHAR(50)
) WITH (
        LOCATION = 'places', --> the root folder containing the Delta Lake files
        data_source = DeltaLakeStorage,
        FILE_FORMAT = DeltaLakeFormat
);

-- Get Data from your Delta Lake Table
SELECT          TOP 10 * 
FROM            Residence
ORDER BY        Id

Conclusion

In this post you learned how to create and query a Delta Lake within your Synapse Analytics Workspace. The main advantage is of course that you now don't need Azure Data Bricks if you are already using Synapse. Making your Data Platform architecture just slightly more clearer and easier.

A disadvantage, at the moment of writing, is the lack of time-traveling withing the Serverless SQL Pool environment. This means you're now forced to use notebooks to create your Data Warehouse when the latest version of your data is just not enough. So please upvote this feature here. There are some more limitations and know issues in the current version, but we think at least some of them will be solved in feature updates.

Thanks to colleague Jeroen Meidam for helping!

Saturday, 27 November 2021

Synapse pipeline pass parameter to notebook

Case
I have a Synapse workspace notebook that I call from a Synapse pipeline, but I want to make it more flexible by adding parameters. How do you add parameters to a notebook and fill them via a pipeline?

Adding Parameters to your Synapse Notebook

Solution
You can add variables to a special Code cell in the notebook and then use those as parameters within the Notebook activity. At the moment there is no real gui retrieving the parameters from the Notebook so you have to copy the names from the notebook to the Notebook activity in the pipeline.

1) Add Code cell for parameters

We need to add a Code cell and change it in to a parameter cell. Note that you can have only one parameter cell in your notebook. You want to add it somewhere at the top so that you can use its variables/parameters in the cells below this parameter cell.

Go to your notebook and add a new Code cell
Move it up. It should probably be your top code cell allowing you to use it in the cells below.
Click in the cell and then on the ellipsis button of that cell (button up right with three dots)
Choose Toggle parameter cell and you will see the word Parameters appear in the bottom right corner

Toggle parameter cell

2) Add variables to parameters cell

Next we need to add some code to the parameter cell. Here you just need to add some variables and then each variable can be overridden by the pipeline and be used in the cells below. For debugging it is usefull to give the variables a value. For this example we used python code.

Adding variables

3) Adjust Synapse Notebook activity

Last step is to edit the Synapse Notebook activity and add the parameters. For each variable you added to the parameters cell you can add a paramater in the notebook activity. At the moment there is no smart interface that lets you select a parameter and set its value. You have to set the name and datatype manually.

Adding parameters

4) Testing

Now run the pipeline to see the result. For this example we added a second Code cell with a print function to show that the default values have changed. Trigger the pipeline and go to the Monitor. Then click on your pipeline and within that pipeline on the Notebook activity. If you click on the pencil icon the notebook will open and allow you to see the result.

Click on pencil te open the Notebook

Note the extra cell and the result of the third cell

Conclusion

In this short post you learned how to add parameters to your notebook and fill them via the pipeline. And as an additional bonus you saw how to check the result of the changes. Next step is forexample to add the Notebook to a Foreach loop that ingest data to the datalake and then execute the notebook to create a Delta Lake table for each item in the Foreach loop.

Monday, 1 March 2021

Pausing/resuming Dedicated SQL Pools from Synapse

Case
In a previous post we described how we can pause and resume our Azure Dedicated SQL pools via Azure Data Factory. The question we are going to solve now: "How can we pause and resume from within an Azure Synapse workspace"?

Pause / Resume SQL dedicated Pools

Solution
Pausing and resuming from within an Azure Synapse workspace is very similar, but there are some differences because there are two types of Azure Dedicated SQL Pools. When you create it from within the Azure Synapse Workspace then the provider will be 'Microsoft.Synapse' (hosted in Synapse), but when you create it outside the Azure Synapse Workspace then the provider will be 'Microsoft.Sql' (hosted in SQL Server). Both have different Rest APIs and different security roles. In this example we will demo the Synapse hosted version but we will also mention the differences. For the SQL version you could also take a look at our previous post which handles the SQL version from within Data Factory.

1) Give Azure Synapse Access to SQL Dedicated Pools
To call the Rest API we need to grant our Azure Synapse workspace access to the Dedicated SQL Pool or more specific to the Synapse hosting it. Perhaps a bit odd, but we need to grant Synapse access to its own resources (for the SQL version you need grant access to the SQL Server hosting the SQL Pool).

For the security role you always need to avoid 'Owner' and use 'Contributor'. This is because the Owner role can also change security related items (for the SQL version you can use 'SQL Server Contributor')

In the Azure portal go to the Azure Synapse Workspace hosting the SQL Pool that you want to pause or resume
In the left menu click on Access control (IAM)
Click on Add, Add role assignment
In the 'Role' drop down select 'Contributor'
In the 'Assign access to' drop down select user, group or service principal.
Search for your Synapse Studio name (in our example‘gansdorp’), select it and click on Save

Grant Synapse Contributor role to SQL Server

If you forget this step then you will receive an authorization error while executing your Synapse pipeline.

2108 Authorization Failed

{"error":
{"code":"AuthorizationFailed"
,"message":"The client 'xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' with object id 'xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' does not have authorization to perform action 'Microsoft.Synapse/workspace/resume/action' over scope '/subscriptions/xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/RG_bitools/providers/Microsoft.Synapse/workspace/gansdorp/dedsqlpools' or the scope is invalid. If access was recently granted, please refresh your credentials."}
}

2) Determine URL

Now it is almost time to edit your Synapse pipeline. The first step will be adding a Web activity to call the Rest API (for the SQL version you must use this Rest API). Find the operation you want to perform and then find the example URL to construct the new URL.

Pause compute

https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/{resource-group-name}/providers/Microsoft.Synapse/workspaces/{workspacename}/sqlPools/{DedicatedSQLPoolName}/pause?api-version=2019-06-01-preview

Resume compute

Within these URL's you need to replace all parts that start and end with a curly bracket: {subscription-id}, {resource-group-name}, {workspacename} and {DedicatedSQLPoolName} (including the brackets themselves).

Example URL

https://management.azure.com/subscriptions/aaaa-bbbb-1234-cccc/resourceGroups/RG-DEV/providers/Microsoft.Synapse/workspaces/gansdorp/dedsqlpool/pause?api-version=2019-06-01-preview

Example Rest API URL

3) Add Web Activity
To call the Rest API we will use the Web Activity in the Synapse pipeline. The actual Rest API call is synchronous, which means it waits for it to finish the pause or resume action and then it will return a message. This also means that you don't have to build any checks to make sure it is already online.

Add a Web activity to your pipeline and give it a suitable name
Go to the Settings tab and use the URL from the previous step in the URL property
Choose POST as method
Fill in {} as body (we don't need it, but it is required)
Choose MSI as authentication method
As the last step enter this URL https://management.azure.com/ as Resource

web activity

To first check the current state of the SQL Pool you can use the Get Rest API and then use an If Condition to check its output with an expression like @Activity('Get Current State').Output.Properties.State

Summary

In this post you learned how pause and resume your Dedicated SQL Pool within Azure Synapse Studio, the method is very similar to the ADF version. There are some differences, however that is mostly caused by the Service that is hosting the SQL Pool. This could be either a SQL Server or your own Synapse Workspace. It is expected that one of both will probably disappear, because it isn't likely that Microsoft will be maintaining two different services with both their own set op Rest APIs.

Monday, 11 January 2021

Scaling Azure Dedicated SQL Pools from ADF

Case
Is there a solution to upscale and downscale my Azure Dedicated SQL Pool from the Azure Data Factory pipeline without scripting? I know there are PowerShell solutions, but I rather use a no-code solution. What are my options?

Scaling Azure Dedicated SQL Pools

Solution
Fortunately you can now use the Rest API's of Azure Dedicated SQL Pools (formerly known as Azure SQL Data Warehouse and for a short period as Azure Synapse Analytics) to down- or upscale the compute. So no coding required.

1) Give ADF Access to SQL Pool
To call the Rest API we need to give ADF access to the SQL Pool or more specific to the SQL Server hosting that SQL Pool. We need a role that can only change the database settings, but nothing security related: Contributer, SQL DB Contributer or SQL Server Contributer.

Go to the Azure SQL Server of the SQL Pool that you want to scale up or down with ADF
In the left menu click on Access control (IAM)
Click on Add, Add role assignment
In the 'Role' drop down select 'SQL DB Contributer'
In the 'Assign access to' drop down select Data Factory
Search for your Data Factory, select it and click on Save

Grant data factory SQL DB Contributor role to SQL Server

If you forget this step then you will receive an authorization error while executing your ADF pipeline.

2108 Authorization Failed

{"error":
{"code":"AuthorizationFailed"
,"message":"The client 'xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' with object id 'xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' does not have authorization to perform action 'Microsoft.Sql/servers/databases/resume/action' over scope '/subscriptions/xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/RG_bitools/providers/Microsoft.Sql/servers/SQL_bitools/databases/bitools' or the scope is invalid. If access was recently granted, please refresh your credentials."}
}

2) Determine URL

Now it is almost time to edit your ADF pipeline. The first step will be adding a Web activity to call the Rest API, but before we can do that we need to determine the URL of this API which you can find here.

Scaling

https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/{resource-group-name}/providers/Microsoft.Sql/servers/{server-name}/databases/{database-name}?api-version=2014-04-01-preview

Within this URL you need to replace all parts that start and end with a curly bracket: {subscription-id}, {resource-group-name}, {server-name} and {database-name} (including the brackets themselves). Don't use a URL (bitools.database.windows.net) for the database server name, but use only the name: bitools.

Example URL

https://management.azure.com/subscriptions/aaaa-bbbb-1234-cccc/resourceGroups/RG_Bitools/providers/Microsoft.Sql/servers/bitools2/databases/bitools?api-version=2014-04-01-preview

Example URL

3) JSON message for Rest API

The Rest API above expects a JSON message with the pricing tier. A list of all pricing tiers can be found here in the column 'Data warehouse units'. Here are two example which you need to adjust for your requirements:

{
    "properties": {
        "requestedServiceObjectiveName": "DW200c"
    }
}

{
    "properties": {
        "requestedServiceObjectiveName": "DW1000c"
    }
}

Note: Just in case you get one of these errors below. The json example in the documentation is incorrect at the moment of writing:

Quotation marks around the data warehouse units are missing, which returns the following error:{"error":{"code":"InvalidRequestContent","message":"The request content was invalid and could not be deserialized: 'Unexpected character encountered while parsing value: D. Path 'properties.requestedServiceObjectiveName', line 3, position 41.'."}}
The c is missing after the data warehouse units (gen1 vs gen2) which returns the following error:
{"code":"45122","message":"\u0027Azure SQL Data Warehouse Gen1 has been deprecated in this region. Please use SQL Analytics in Azure Synapse.\u0027","target":null,"details":[{"code":"45122","message":"\u0027Azure SQL Data Warehouse Gen1 has been deprecated in this region. Please use SQL Analytics in Azure Synapse.\u0027","target":null,"severity":"16"}],"innererror":[]}

4) Add Web Activity
To call the Rest API we will use the Web Activity in the ADF pipeline. The actual Rest API call is synchronous, which means it waits for it to finish the pause or resume action and then it will return a message. This also means that you don't have to build any checks to make sure it is already online.

Add a Web activity to your pipeline and give it a suitable name
Go to the Settings tab and use the URL from step 2 in the URL property
Choose PATCH as method
Add a new header with the name 'Content-Type' and the value 'application/json'
Fill in the JSON message from step 3 as body
Choose MSI as authentication method
As the last step enter this URL https://management.azure.com/ as Resource

Use a Web activity to call the Rest API

If you want to scale up before your ETL/ELT process and scale down afterwards then you need two separate Web activities or one clever child pipeline with parameters that you execute from your main pipeline.

Summary

In this post you learned how to upscale and downscale your Dedicated SQL Pool to save some money on your Azure bill without writing any code. Note that at the moment of writing live scaling is not yet available and that you will loose the connection to your Dedicated SQL Pool for a couple of minutes.

Also note that ADF pipelines slightly differ from Azure Synapse Analytics pipelines. So if you consider switching to Synapse workspaces because you apparently already use Dedicated SQL Pools then you have to make some small adjustments to this specific task which will be described in a next post. In an other post we already showed how to pause and resume your Azure SQL Pools from within ADF.

Wednesday, 6 January 2021

Pausing and resuming Dedicated SQL Pools from ADF

Case
Is there a solution to pause and resume an Azure Dedicated SQL Pool from the Azure Data Factory pipeline without scripting? I know there are PowerShell solutions, but I rather use a no-code solution. What are my options?


Pause and resume Azure Dedicated SQL Pools

Solution
Luckily you can now use the Rest API's of Azure Dedicated SQL Pools (formerly known as Azure SQL Data Warehouse and for a short period as Azure Synapse Analytics) to pause or resume the compute.

Go to the Azure SQL Server of the SQL Pool that you want to pause or resume with ADF
In the left menu click on Access control (IAM)
Click on Add, Add role assignment
In the 'Role' drop down select 'SQL DB Contributor'
In the 'Assign access to' drop down select Data Factory
Search for your Data Factory, select it and click on Save

Grant data factory SQL DB Contributor role to SQL Server

If you forget this step then you will receive an authorization error while executing your ADF pipeline.

2108 Authorization Failed

{"error":
{"code":"AuthorizationFailed"
,"message":"The client 'xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' with object id 'xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' does not have authorization to perform action 'Microsoft.Sql/servers/databases/resume/action' over scope '/subscriptions/xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/RG_bitools/providers/Microsoft.Sql/servers/SQL_bitools/databases/bitools' or the scope is invalid. If access was recently granted, please refresh your credentials."}
}

2) Determine URL

Pause compute

Resume compute

Within these URL's you need to replace all parts that start and end with a curly bracket: {subscription-id}, {resource-group-name}, {server-name} and {database-name} (including the brackets themselves). Don't use a URL (bitools.database.windows.net) for the database server name, but use only the name: bitools.

Example URL

https://management.azure.com/subscriptions/aaaa-bbbb-1234-cccc/resourceGroups/RG_Bitools/providers/Microsoft.Sql/servers/bitools2/databases/bitools/resume?api-version=2014-04-01-preview

Example URL

3) Add Web Activity
To call the Rest API we will use the Web Activity in the ADF pipeline. The actual Rest API call is synchronous, which means it waits for it to finish the pause or resume action and then it will return a message. This also means that you don't have to build any checks to make sure it is already online.

Add a Web activity to your pipeline and give it a suitable name
Go to the Settings tab and use the URL from the previous step in the URL property
Choose POST as method
Fill in {} as body (we don't need it, but it is required)
Choose MSI as authentication method
As the last step enter this URL https://management.azure.com/ as Resource

Use a Web activity to call the Rest API

You could also first check the current status via the Check database state Rest API and then use an expression like @activity('Get Status').output.properties.status to retrieve the current state.

Summary

In this post you learned how to pause and resume your Dedicated SQL Pool to save some money on your Azure bill without writing any code. Note that you only pause the compute and that you still have to pay for the storage of the SQL Pool.

Also note that ADF pipelines slightly differ from Azure Synapse Analytics pipelines. So if you consider switching to Synapse workspaces because you apparently already use Dedicated SQL Pools then you have to make some small adjustments to this specific task which will be described in a next post. In an other post we will also show how to scale your Azure SQL Pools from within ADF.

Friday, 2 October 2020

Pause all/an Azure Synapse SQL Pool(s) (Az)

Case
I want to schedule a pause of my Azure Synapse SQL Pools to save some money on my Azure bills. Back in 2017 when Synapse was still called SQL Data Warehouse we wrote a post with the AzureRM modules which are now outdated. How does it work with AZ?

Pause and resume Synapse SQL Pools

Solution
For this example we will have two scripts. The first to stop all Azure Synapse SQL Pools within an Azure subscription. This is especially handy for Development, Test or Acceptance environments where you only turn on the Synapse SQL Pools when you need to develop or test something. When you forget to pause them afterwards this script will pause them all on a scheduled moment. The second script is for pausing (or resuming) a specific Synapse SQL Pool. Probably more suitable for a production environment where you don't want to ruthlessly pause all Synapse SQL Pools.

Note: All basics to create your first Azure Automation Runbook can be found here. Combine that with the code below.

1) Modules

If you want to use this script in an Azure Automation Runbook you first need to add the module Az.Sql and that module first wants the module Az.Accounts to be installed. If you already added AzureRM modules to your Azure Automation account then that is no problem. As long as you only don't mix them up in your runbooks. On your own PC you cannot have both installed. You first need to remove all AzureRm modules before you can add Az modules.

Go to your Azure Automation account in the Azure Portal
Click on Modules in the left menu
Click on Browse Gallery (not on Add a module)
Search for Az.Sql and click on it
Next click on Import and then on the Ok button

Note 1: If you haven't installed Az.Account then it will ask you to do that first.

Note 2: It takes a few minutes to import a module

Adding a module

2) Login

If you want to run the script in PowerShell ISE then you first need to login and select your subscription if you have multiple subscriptions.

# PowerShell code
# Login to Azure (browser popup will appear)
Connect-AzAccount -Confirm

# Optional: select your subscription
Set-AzContext -SubscriptionName "mysubscription"

In your Runbook you need to add the following code instead which uses your Run as Account to login. Please read this blogpost for all details.

# PowerShell code
########################################################
# Log in to Azure with AZ (standard code)
########################################################
Write-Verbose -Message 'Connecting to Azure'

# Name of the Azure Run As connection
$ConnectionName = 'AzureRunAsConnection'
try
{
    # Get the connection properties
    $ServicePrincipalConnection = Get-AutomationConnection -Name $ConnectionName        
 
    'Log in to Azure...'
    $null = Connect-AzAccount `
        -ServicePrincipal `
        -TenantId $ServicePrincipalConnection.TenantId `
        -ApplicationId $ServicePrincipalConnection.ApplicationId `
        -CertificateThumbprint $ServicePrincipalConnection.CertificateThumbprint 
}
catch 
{
    if (!$ServicePrincipalConnection)
    {
        # You forgot to turn on 'Create Azure Run As account' 
        $ErrorMessage = "Connection $ConnectionName not found."
        throw $ErrorMessage
    }
    else
    {
        # Something else went wrong
        Write-Error -Message $_.Exception.Message
        throw $_.Exception
    }
}
########################################################

3) The pause all script

Below the above script we need to add the following script which first gets all Azure SQL Servers and then checks whether they also host a Synapse SQL Pool. When one is found it checks the status. If it is still Online (active) then it will pause it. This will take several minutes. Afterwards it will recheck the status and show how long it took to pause. Feel free to add more checks and let us know in the comments what you added.

# PowerShell code
# Get all SQL Servers to check whether they host a Synapse SQL Pool
$allSqlServers = Get-AzSqlServer

# Loop through all SQL Servers
foreach ($sqlServer in $allSqlServers)
{
    # Log which SQL Servers are checked and in which resource group
    Write-Output "Checking SQL Server [$($sqlServer.ServerName)] in Resource Group [$($sqlServer.ResourceGroupName)] for Synapse SQL Pools"
    
    # Get all databases from a SQL Server, but filter on Edition = "DataWarehouse"
    $allSynapseSqlPools = Get-AzSqlDatabase -ResourceGroupName $sqlServer.ResourceGroupName `
                                           -ServerName $sqlServer.ServerName `
                                           | Where-Object {$_.Edition -eq "DataWarehouse"}
    # Loop through each found Synapse SQL Pool
    foreach ($synapseSqlPool in $allSynapseSqlPools)
    {
        # Show status of found Synapse SQL Pool
        # Available statuses: Online Paused Pausing Resuming
        Write-Output "Synapse SQL Pool [$($synapseSqlPool.DatabaseName)] found with status [$($synapseSqlPool.Status)]"
        
        # If status is online then pause Synapse SQL Pool
        if ($synapseSqlPool.Status -eq "Online")
        {
            # Pause Synapse SQL Pool
            $startTimePause = Get-Date
            Write-Output "Pausing Synapse SQL Pool [$($synapseSqlPool.DatabaseName)]"
            $resultsynapseSqlPool = $synapseSqlPool | Suspend-AzSqlDatabase

            # Show that the Synapse SQL Pool has been pause and how long it took
            $endTimePause = Get-Date
            $durationPause = NEW-TIMESPAN –Start $startTimePause –End $endTimePause
            $synapseSqlPool = Get-AzSqlDatabase -ResourceGroupName $sqlServer.ResourceGroupName `
                                                -ServerName $sqlServer.ServerName `
                                                -DatabaseName $synapseSqlPool.DatabaseName
            Write-Output "Synapse SQL Pool [$($synapseSqlPool.DatabaseName)] paused in $($durationPause.Hours) hours, $($durationPause.Minutes) minutes and  $($durationPause.Seconds) seconds. Current status [$($synapseSqlPool.Status)]"
        }
    }
}

In PowerShell ISE the result will look this. To see the result of your runbook you need to check the jobs and the the output.

Running the script in PowerShell ISE

4) The pause one script - parameters

To only select one Synapse SQL Pool we need to provide three parameters: Resource Group name, SQL Server name and the name of the SQL Pool. For this we will add parameter code at the beginning of the complete script (above login). You can add parameter validations to make it more monkey proof.

# PowerShell code
<#
    .SYNOPSIS
        Pause an Azure Synapse SQL Pool
    .DESCRIPTION
        By providing the following parameters you can pause one
        specific Azure Synapse SQL Pool. It will only pause when
        the status is 'online'

    .PARAMETER resourceGroupName
        This is the Resource group where Azure Synapse Analytics
        SQL Pool is located

    .PARAMETER sqlServerName
        This is the name of the Azure SQL Server hosting the Azure
        Synapse Analytics SQL Pool
    
    .PARAMETER SynapseSqlPoolName
        This is the name of the Azure Synapse Analytics SQL Pool

#>
Param(
    # This is the Resource group where Azure Synapse Analytics SQL Pool is located   
    [Parameter(Mandatory=$True)]  
    [String] $resourceGroupName
    ,
    # This is the name of the Azure SQL Server hosting the Azure Synapse Analytics SQL Pool
    [Parameter(Mandatory=$True)]  
    [String] $sqlServerName
    ,
    # This is the name of the Azure Synapse Analytics SQL Pool
    [Parameter(Mandatory=$True)]  
    [String] $SynapseSqlPoolName
)

5) The pause one script

Now the actual pause one script to pause only one SQL Pool. This replaces the script of step 3. If you also want a resume script you just have to replace Pause-AzSqlDatabase by Resume-AzSqlDatabase. and of course chancing some of the texts and if statement (Online => Paused). You could also merge the pause and resume script into one script by adding an extra parameter to indicate what you want to do.

# PowerShell code
# Get one specific Synapse SQL Pool
$synapseSqlPool = Get-AzSqlDatabase -ResourceGroupName $resourceGroupName `
                                    -ServerName $sqlServerName `
                                    -DatabaseName $SynapseSqlPoolName `
                                    | Where-Object {$_.Edition -eq "DataWarehouse"}

# Check if the Synapse SQL Pool can be found with the provided parameters
if ($synapseSqlPool)
{
    # Show status of found Synapse SQL Pool
    # Available statuses: Online Paused Pausing Resuming
    Write-Output "Synapse SQL Pool [$($synapseSqlPool.DatabaseName)] found with status [$($synapseSqlPool.Status)]"

    # If status is online then pause Synapse SQL Pool
    if ($synapseSqlPool.Status -eq "Online")
    {
        # Pause Synapse SQL Pool
        $startTimePause = Get-Date
        Write-Output "Pausing Synapse SQL Pool [$($synapseSqlPool.DatabaseName)]"
        $resultsynapseSqlPool = $synapseSqlPool | Suspend-AzSqlDatabase

        # Show that the Synapse SQL Pool has been pause and how long it took
        $endTimePause = Get-Date
        $durationPause = NEW-TIMESPAN –Start $startTimePause –End $endTimePause
        $synapseSqlPool = Get-AzSqlDatabase -ResourceGroupName $resourceGroupName `
                                            -ServerName $sqlServerName `
                                            -DatabaseName $SynapseSqlPoolName
        Write-Output "Synapse SQL Pool [$($synapseSqlPool.DatabaseName)] paused in $($durationPause.Hours) hours, $($durationPause.Minutes) minutes and  $($durationPause.Seconds) seconds. Current status [$($synapseSqlPool.Status)]"
    }
}
else
{
    Throw "Synapse SQL Pool [$($SynapseSqlPoolName)] not found. Check parameter values."
}

Running the script in PowerShell ISE

Summary

In this post you saw how to pause one or more Synapse SQL Pools to save some money on your Azure bill, but note that the storage costs will continue when you pause Synapse. Next step is to either schedule it within your Azure Automation account or to add a Webhook and execute it from an other service like Azure Data Factory

Besides scripting you can also use the Rest API of Synapse which is particularly easy if you want to pause or resume from within an ADF pipeline, but that will be explored in an other blogpost.

Sunday, 3 May 2020

Use Managed Service Identity for Synapse PolyBase

Case
I a previous PolyBase example, to read data from the storage account, we stored the secret in Synapse. Is it possible to use Managed Identity instead of storing secrets in Synapse? However I can not find the Managed Identity of my SQL Server.

Azure Synapse Analytics with PolyBase reading Azure Storage Account

Solution
You can use a Managed Identity, but there are two requirements. First this only works with 'StorageV2 (general purpose v2)'. 'BlobStorage' or 'Storage (general purpose v1)' will not work! Secondly you need to register your SQL Server that hosts Synapse in your Active Directory. This will allow you to select your SQL Server within the Access control (IAM).

1) Create Storage Account
Create an Azure Storage Account and make sure the type is StorageV2 (general purpose v2). The storage account of this example is called 'bitoolsstorage' and it has a container called 'mycontainer'. You can choose your own names, but these names will be used in the example code.

Bijschrift toevoegen

2) Create Synapse
Create a Synapse Data Warehouse including a SQL Server to host it. Our SQL Server is called 'bitoolssynapseserver' and our Synapse SQL Pool (data warehouse) is called 'synapsedwh'. Again choose your own names and change those in the example code below.

Synapse SQL Pool (data warehouse)

3) Register SQL Server in AD
Next step is to register the SQL Server that hosts your Synapse DWH in the Active Directory. This will allow you to find your SQL Server in the next step as a Managed Identity. At the moment of writing this needs to be done via PowerShell and cannot be done via the portal.

We will be using Cloud Shell (PowerShell in the portal), but you can also use PowerShell (ISE) on your Windows device but then you have to execute two extra commands (login and select subscription).

Click on the Cloud Shell icon in the upper right corner (next to the searchbox). This will start PowerShell in the portal. If this is the first time using it you first need to connect it to an Azure Storage Account.
Then execute the Set-AzSqlServer command. The first parameter is the resource group where SQL Server is located. The second parameter is the name of SQL Server (without .database.windows.net) and the last parameter will assign the Managed Identity.

# PowerShell
Set-AzSqlServer -ResourceGroupName "Joost_van_Rossum" -ServerName "bitoolssynapseserver" -AssignIdentity

If you are using PowerShell on your Windows device instead of Cloud Shell then use this code

# PowerShell
# Login to Azure (popup will appear)
Connect-AzAccount

# Select your subscription
Select-AzSubscription -SubscriptionId "2c67b23a-4ba2-4273-bc82-274a743b43af"

# Assign Managed Identity
Set-AzSqlServer -ResourceGroupName "Joost_van_Rossum" -ServerName "bitoolssynapseserver" -AssignIdentity

4) Storage Blob Data Contributor
Now it's time to give your SQL Server access to the Azure Storage Account. The role we need for this according the documentation is 'Storage Blob Data Contributor', but I also tested it with 'Storage Blob Data Reader' and that works fine as well (since we are only reading data). Note: You need to be owner of the resource (group) to delegate access to others.

Go to your Storage Account from step 1
Click on Access control (IAM) in the left menu
Click on the + Add icon and choose Add role assignment
In the Role drop down select 'Storage Blob Data Contributor'
Leave the Assign access to drop down unchanged
In the Select box start typing the name of your SQL Server
Select your SQL Server and click on the Save button

Deligate Access to Managed Identity of SQL Server

5) Master Key
We are finished in the Azure portal and now its time to start with the actual PolyBase code. Start SQL Server Managed Studio (SSMS), but make sure your Synapse is not paused.

First step is to create a master key to encrypt any secrets, but only if you do not already have one (although we will not use any secrets). You can check that in the table sys.symmetric_keys. If a row exists where the symmetric_key_id column is 101 (or the name column is '##MS_DatabaseMasterKey##') then you already have a master key. Otherwise we need to create one. For Synapse a masterkey password is optional. For this example we will not use the password.

--Master key
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys WHERE symmetric_key_id = 101)
BEGIN
    PRINT 'Creating Master Key'
    CREATE MASTER KEY;
END
ELSE
BEGIN
    PRINT 'Master Key already exists'
END

6) Credentials
Next step is to create a credential which will be used to access the Storage Account. For a Managed Identity you don't use secrets:

--Credential
CREATE DATABASE SCOPED CREDENTIAL bitools_msi
WITH
    IDENTITY = 'Managed Service Identity'
;

Tip:
Give the credential a descriptive name so that you know where it is used for. You can find all credentials in the table sys.database_credentials:

--Find all credential
SELECT * FROM sys.database_credentials

7) External data source
With the credential from the previous step we will create an External data source that points to the Storage Account and container where your file is located. Execute the code below where:

TYPE = HADOOP (because PolyBase uses the Hadoop APIs to access the container)
LOCATION = the connection string to the container in your Storage Account starting with abfss.
CREDENTIAL = the name of the credentials created in the previous step.

--Create External Data Source
CREATE EXTERNAL DATA SOURCE bitoolsstorage_abfss
WITH (
    TYPE = HADOOP,
    LOCATION = 'abfss://mycontainer@bitoolsstorage.dfs.core.windows.net',
    CREDENTIAL = bitools_msi
);

Tip:
Give the external source a descriptive name so that you know where it is used for. You can find all external data sources in the table sys.external_data_sources:

--Find all external sources
SELECT * FROM sys.external_data_sources

Notice that the filename or subfolder is not mentioned in the External Data Source. This is done in the External Table. This allows you to use multiple files from the same folder as External Tables.

8) External File format
Now we need to describe the format used in the source file. In our case we have a comma delimited file. You can also use this file format to supply the date format, compression type or encoding.

--Create External Data Source
CREATE EXTERNAL FILE FORMAT TextFile
WITH (
    FORMAT_TYPE = DelimitedText,
    FORMAT_OPTIONS (FIELD_TERMINATOR = ',')
);

Tip:
Give the format a descriptive name so that you know where it is used for. You can find all external file formats in the table sys.external_file_formats:

--Find all external file formats
SELECT * FROM sys.external_file_formats

9) External Table
The last step before we can start quering, is creating the external table. In this create table script you need to specify all columns, datatypes and the filename that you want to read. The filename starts with a forward slash. You also need the datasource from step 7 and the file format from step 8.

--Create External table
CREATE EXTERNAL TABLE dbo.sensordata (
    [Date] nvarchar(50) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=bitoolsstorage_abfss, -- from step 7
    FILE_FORMAT=TextFile              -- from step 8
);

Note:
PolyBase does not like columnname headers. It will handle it like a regular data row and throw an error when the datatype doesn't match. There is a little workaround for this with REJECT_TYPE and REJECT_VALUE. However this only works when the datatype of the header is different than the datatypes of the actual rows. Otherwise you have to filter the header row in a subsequent step.

--Create External table with header
CREATE EXTERNAL TABLE dbo.sensordata2 (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=bitoolsstorage_abfss,
    FILE_FORMAT=TextFile,
    REJECT_TYPE = VALUE, -- Reject rows with wrong datatypes
    REJECT_VALUE = 1     -- Allow 1 failure (the header)
);

You can find all external tables in the table sys.external_tables.

--Find all external tables
SELECT * FROM sys.external_tables

However you can also find the External Table (/the External Data Source/the External File Format) in the Object Explorer of SSMS.

SSMS Object Explorer

10) Query external table
Now you can query the external table like any other regular table. However the table is read-only so you can not delete, update or insert records. If you update the source file then the data in this external table also changes instantly because the file is used to get the data.

--Testing
SELECT count(*) FROM dbo.sensordata;
SELECT * FROM dbo.sensordata;

Quering an external table

Conclusion
In this post you learned how to give the Managed Identity of SQL Server access to your Storage Account. This saves you some maintenance for the secrets. And you learned how to use PolyBase to read files from that Storage Account using the Managed Identity.