Sunday, 3 May 2020

Use Managed Service Identity for Synapse PolyBase

Case
I a previous PolyBase example, to read data from the storage account, we stored the secret in Synapse. Is it possible to use Managed Identity instead of storing secrets in Synapse? However I can not find the Managed Identity of  my SQL Server.
Azure Synapse Analytics with PolyBase reading Azure Storage Account














Solution
You can use a Managed Identity, but there are two requirements. First this only works with 'StorageV2 (general purpose v2)'. 'BlobStorage' or 'Storage (general purpose v1)' will not work! Secondly you need to register your SQL Server that hosts Synapse in your Active Directory. This will allow you to select your SQL Server within the Access control (IAM).

1) Create Storage Account
Create an Azure Storage Account and make sure the type is StorageV2 (general purpose v2). The storage account of this example is called 'bitoolsstorage' and it has a container called 'mycontainer'. You can choose your own names, but these names will be used in the example code.
Bijschrift toevoegen















2) Create Synapse
Create a Synapse Data Warehouse including a SQL Server to host it. Our SQL Server is called 'bitoolssynapseserver' and our Synapse SQL Pool (data warehouse) is called 'synapsedwh'. Again choose your own names and change those in the example code below.
Synapse SQL Pool (data warehouse)












3) Register SQL Server in AD
Next step is to register the SQL Server that hosts your Synapse DWH in the Active Directory. This will allow you to find your SQL Server in the next step as a Managed Identity. At the moment of writing this needs to be done via PowerShell and cannot be done via the portal.

We will be using Cloud Shell (PowerShell in the portal), but you can also use PowerShell (ISE) on your Windows device but then you have to execute two extra commands (login and select subscription).

  • Click on the Cloud Shell icon in the upper right corner (next to the searchbox). This will start PowerShell in the portal. If this is the first time using it you first need to connect it to an Azure Storage Account.
  • Then execute the Set-AzSqlServer command. The first parameter is the resource group where SQL Server is located. The second parameter is the name of SQL Server (without .database.windows.net) and the last parameter will assign the Managed Identity.
# PowerShell
Set-AzSqlServer -ResourceGroupName "Joost_van_Rossum" -ServerName "bitoolssynapseserver" -AssignIdentity
Register SQL Server as Managed Identity
















If you are using PowerShell on your Windows device instead of Cloud Shell then use this code
# PowerShell
# Login to Azure (popup will appear)
Connect-AzAccount

# Select your subscription
Select-AzSubscription -SubscriptionId "2c67b23a-4ba2-4273-bc82-274a743b43af"

# Assign Managed Identity
Set-AzSqlServer -ResourceGroupName "Joost_van_Rossum" -ServerName "bitoolssynapseserver" -AssignIdentity

4) Storage Blob Data Contributor
Now it's time to give your SQL Server access to the Azure Storage Account. The role we need for this according the documentation is 'Storage Blob Data Contributor', but I also tested it with 'Storage Blob Data Reader' and that works fine as well (since we are only reading data). Note: You need to be owner of the resource (group) to delegate access to others.
  • Go to your Storage Account from step 1
  • Click on Access control (IAM) in the left menu
  • Click on the + Add icon and choose Add role assignment
  • In the Role drop down select 'Storage Blob Data Contributor'
  • Leave the Assign access to drop down unchanged
  • In the Select box start typing the name of your SQL Server
  • Select your SQL Server and click on the Save button
Deligate Access to Managed Identity of SQL Server
















5) Master Key
We are finished in the Azure portal and now its time to start with the actual PolyBase code. Start SQL Server Managed Studio (SSMS), but make sure your Synapse is not paused.

First step is to create a master key to encrypt any secrets, but only if you do not already have one (although we will not use any secrets). You can check that in the table sys.symmetric_keys. If a row exists where the symmetric_key_id column is 101 (or the name column is '##MS_DatabaseMasterKey##') then you already have a master key. Otherwise we need to create one. For Synapse a masterkey password is optional. For this example we will not use the password.
--Master key
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys WHERE symmetric_key_id = 101)
BEGIN
    PRINT 'Creating Master Key'
    CREATE MASTER KEY;
END
ELSE
BEGIN
    PRINT 'Master Key already exists'
END 


6) Credentials
Next step is to create a credential which will be used to access the Storage Account. For a Managed Identity you don't use secrets:
--Credential
CREATE DATABASE SCOPED CREDENTIAL bitools_msi
WITH
    IDENTITY = 'Managed Service Identity'
;

Tip:
Give the credential a descriptive name so that you know where it is used for. You can find all credentials in the table sys.database_credentials:
--Find all credential
SELECT * FROM sys.database_credentials


7) External data source
With the credential from the previous step we will create an External data source that points to the Storage Account and container where your file is located. Execute the code below where:
  • TYPE = HADOOP (because PolyBase uses the Hadoop APIs to access the container)
  • LOCATION = the connection string to the container in your Storage Account starting with abfss.
  • CREDENTIAL = the name of the credentials created in the previous step.
--Create External Data Source
CREATE EXTERNAL DATA SOURCE bitoolsstorage_abfss
WITH (
    TYPE = HADOOP,
    LOCATION = 'abfss://mycontainer@bitoolsstorage.dfs.core.windows.net',
    CREDENTIAL = bitools_msi
);

Tip:
Give the external source a descriptive name so that you know where it is used for. You can find all external data sources in the table sys.external_data_sources:
--Find all external sources
SELECT * FROM sys.external_data_sources

Notice that the filename or subfolder is not mentioned in the External Data Source. This is done in the External Table. This allows you to use multiple files from the same folder as External Tables.


8) External File format
Now we need to describe the format used in the source file. In our case we have a comma delimited file. You can also use this file format to supply the date format, compression type or encoding.
--Create External Data Source
CREATE EXTERNAL FILE FORMAT TextFile
WITH (
    FORMAT_TYPE = DelimitedText,
    FORMAT_OPTIONS (FIELD_TERMINATOR = ',')
);

Tip:
Give the format a descriptive name so that you know where it is used for. You can find all external file formats in the table sys.external_file_formats:
--Find all external file formats
SELECT * FROM sys.external_file_formats

9) External Table
The last step before we can start quering, is creating the external table. In this create table script you need to specify all columns, datatypes and the filename that you want to read. The filename starts with a forward slash. You also need the datasource from step 7 and the file format from step 8.
--Create External table
CREATE EXTERNAL TABLE dbo.sensordata (
    [Date] nvarchar(50) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=bitoolsstorage_abfss, -- from step 7
    FILE_FORMAT=TextFile              -- from step 8
);
Note:
PolyBase does not like columnname headers. It will handle it like a regular data row and throw an error when the datatype doesn't match. There is a little workaround for this with REJECT_TYPE and REJECT_VALUE. However this only works when the datatype of the header is different than the datatypes of the actual rows. Otherwise you have to filter the header row in a subsequent step.
--Create External table with header
CREATE EXTERNAL TABLE dbo.sensordata2 (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=bitoolsstorage_abfss,
    FILE_FORMAT=TextFile,
    REJECT_TYPE = VALUE, -- Reject rows with wrong datatypes
    REJECT_VALUE = 1     -- Allow 1 failure (the header)
);
You can find all external tables in the table sys.external_tables.
--Find all external tables
SELECT * FROM sys.external_tables
However you can also find the External Table (/the External Data Source/the External File Format) in the Object Explorer of SSMS.
SSMS Object Explorer

























10) Query external table
Now you can query the external table like any other regular table. However the table is read-only so you can not delete, update or insert records. If you update the source file then the data in this external table also changes instantly because the file is used to get the data.
--Testing
SELECT count(*) FROM dbo.sensordata;
SELECT * FROM dbo.sensordata;
Quering an external table

























Conclusion
In this post you learned how to give the Managed Identity of SQL Server access to your Storage Account. This saves you some maintenance for the secrets. And you learned how to use PolyBase to read files from that Storage Account using the Managed Identity.




Friday, 1 May 2020

Azure Data Factory - Use Key Vault Secret in pipeline

Case
I want to use secrets from Azure Key Vault in my Azure Data Factory (ADF) pipeline, but only certain properties of the Linked Services can be filled by secrets of a Key Vault. Is it possible to retrieve Key Vault secrets in the ADF pipeline itself?

Using Azure Key Vault for Azure Data Factory













Solution
Yes the easiest way is to use a web activity with a RestAPI call to retrieve the secret from Key Vault. The documentation is a little limited and only shows how to retrieve a specific version of the secret via the secret identifier (a guid). This is not a workable solution for two reasons:
  1. The guid changes each time you change the secret. In probably 99% of the cases you just want to get the latest of version of the secret. This means you need to change that guid in ADF as well when you change a secret.
  2. The guid differs on each environment of your key vault (dev, test, acceptance and production). This makes it hard to use this solution in a multi ADF environment.
For this example we will misuse Key Vault a little bit as a configuration table and retrieve the RestAPI url of a database from the Key Vault. The example assumes you already have a Key Vault filled with secrets. If you don't have that then executes the first two steps of this post.

1) Access policies
First step is to give ADF access to the Key Vault to read its content. You can now find ADF by its name so you don't have to search for its managed identity guid, but using that guid is also still possible.
  • Go to Access policies in the left menu of your Key Vault
  • Click on the blue + Add Access Policy link
  • Leave Configure from template empty
  • Leave Key permissions unselected (we will only use a Secret for this example)
  • Select Get for Secret permissions
  • Leave Certificate permissions unselected (we will only use a Secret for this example)
  • Click on the field of Select principal to find the name of your Azure Data Factory
  • Leave Authorized application unchanged
  • Click on Add and a new Application will appear in the list of Current Access Policies
Add Access policy
















Note: for this specific example we do not need to create a Key Vault Linked Service in ADF.

2) Determine URL of secret
To retrieve the secrets we need the RestAPI URL of that secret. This URL is constructed as
https://{Name Keyvault}.vault.azure.net/secrets/{SecretName}?api-version=7.0

{Name Keyvault} : is the name of the keyvault you are using
{SecretName} : is the secretName

In this example the secretName is "SQLServerURL" and the URL should be looking like this https://{Name Keyvault}.vault.azure.net/secrets/SQLServerURL?api-version=7
Get the SecretName from Key Vault












3) Web activity
Next we have to add the activity ‘web’ into the ADF pipeline. Use the following settings in the settings tab.
  • Set URL to https://{Name Keyvault}.vault.azure.net/secrets/SQLServerURL?api-version=7.0
  • Set Method to Get
  • Under Advanced select MSI
  • And set the resource to https://vault.azure.net

Configuring the Web activity to retrieve the secret


















4) Retrieve value
Now we want to use the secret from the Key Vault in a successive activity, in this case another web activity to upscale a database. In the URL property of this activity we now use the output value from the previous webactivity.
@activity('GetURL').output.value
Retrieve output value via expression















5) The result
To check the result of the changes we need to execute the pipeline.
Execute the pipeline to check the result

















Note: if you are using this to retrieve real secrets like passwords and you don't want them to show up in the logging of Azure Data Factory then check the Secure output property on the general tab of your activity.
Don't show secret in logging




















Conclusion
In this blogpost your learned how to retrieve Key Vault secrets in ADF. The trick is to retrieve them by there name instead of by there version Guid. This will always give you the latest version and allows you to use this construction in multiple environments.

Update: ADF now supports Global parameters to store parameters that can be used by all pipelines

Thursday, 30 April 2020

Azure Data Factory - Parameters event based triggers

Case
My files arrive at various moments during the day and they need to be processed immediately on arrival in the blob storage container. At the moment each file has its own pipeline with its own event based trigger. Is there a more sustainable way where I don't have to create a new pipeline for each new file?
Event based triggers can pass through parameters











Solution
Yes, if your source files have names that can easily be matched to table names then it is very easy and you only need a couple of parameters and a single Copy Data activity. If you cannot match a source file by its name then you might want to add a Lookup activity that can find the match in a configuration table. Or if unexpected files can also arrive in that same blob storage container then you also might want to add the Lookup activity to check whether the file is expected.

The solution uses parameters that can be filled by the event based trigger. With a simple expression you can pass on the filename and folderpath to the pipeline.

1) Pipeline parameters
Start with a simple pipeline that only contains a single Copy Data activity that copies a specific file to a specific SQL Server table. Then add two parameters to this pipeline. One for the filename and one for the folderpath: SourceFileName and SourceFolderPath. The folder path parameter is optional for this basic example and will contain the containername and the foldername combined: "container/folder".
Add 2 pipeline parameters


















2) Dataset sourcefile
Go to the Dataset of your source file from the blob storage container. In the Parameters tab add a new string parameters called FileName. As default value you can use the name of the file. Optionally you can add a second string parameter for the folderpath.
Add parameter FileName to dataset














Next step is to use this newly created dataset parameter in the connection to the file. Go to the Connection tab of your dataset and replace the filename with some dynamic content: @dataset().FileName
Add dynamic content for filename












If you want to use the folderpath parameter as well then you can add it to the container textbox and leave the Directory textbox empty (the containername is mandatory, but besides the container name you can also add the folderpath).

3) Dataset stage table
Now repeat the same actions for the dataset that points to your stage table. First add a parameter called TableName in the parameter tab.
Add parameter TableName to dataset













And then replace the Table property with some dynamic content: @dataset().TableName (check the Edit checkbox below the table before adding the expression).
Use dataset parameter to configure the tablename













4) Copy Data activity - Source
Now go back to your pipeline and edit the Source of your Copy Data activity. Notice that you now see a new Dataset property called FileName that points to the parameter of the dataset. This is where you fill the dataset parameter with the value of the pipeline paramter. The value should be set with some dynamic content: @pipeline().parameters.SourceFileName. This is the pipeline parameter created in step 1.
Copy Data activity - Dynamic content in FileName





















5) Copy Data activity - Sink
Now repeat the same for the Sink where you now see a new Dataset property called TableName. Here is where you need to match the filename to the table name. In our example the tablename is equal to the filename without the extension. Therefor we replaced the extension with the replace expression: @replace(pipeline().parameters.SourceFileName,'.csv', '')
Copy Data activity - Dynamic content in TableName





















6) Edit Trigger
The last step of the solution is passing through the filename and folderpath from the trigger to the pipeline parameters with an triggerbody expression that you also see in Azure Logic App and Power Automate:
Returns the filename of the file that caused the trigger:
@triggerBody().fileName
Returns the folderpath of the file that caused the trigger:
@triggerBody().folderPath

When you edit or create a new trigger in the pipeline then the last step is filling in the variables. By using the above expressions they automatically fill them with information of the file that caused the trigger.
triggerBody() expressions






















When two or more file are uploaded then each of them will trigger the same pipeline separately, but all with different values in the pipeline parameters.

7) Testing
After publishing all changes we can upload several files, one at a time or multiple at once, to test the new trigger pipeline. The last two rows indicate that the files where uploaded at the same time, but both triggered the pipeline separately. You can check the parameters by clicking on [@].
Monitoring the pipeline runs











Conclusion
In this post you learned how to use the triggerBody() expression from the event based triggers. You can pass on its value to the Pipeline parameters and then pass on those pipeline parameters to the dataset parameters.

With this very basic construction you can handle as many files as you like. If you do not need to process them as once then we suggest to use the foreach loop construction instead.

Friday, 24 April 2020

Power Apps Snack: Don't repeat yourself

Case
I have some pieces of code in my Power App that I use for several buttons, but I don't want to create multiple copies of that code. Is there a way to create a method or function with custom code that I can call from various buttons?
Power Apps and custom functions











Solution
No, Power Apps does not support methods or functions like real programming languages. However you can use the SELECT function. This function allows you to execute code from other objects on your screen, but only objects that have an OnSelect event.

Example 1: basics
For this example add two buttons to the screen: Button1 and Button2. In the OnSelect of Button1 add a simple Notify expression:
Notify(
    "Hello",
    NotificationType.Success,
    1000
)
Notify in OnSelect of Button1













In the OnSelect of Button2 add the Select expression:
Select(Button1)
Select in OnSelect of Button2













The Result
Now hit the play button to see the preview of your app. Then click both buttons and see the Notify showing for both buttons. You can even make Button1 invisible if you don't want to execute that code by it self, but only via other buttons.
Testing simple Select solution




















Example 2: parameters
By adding variables to the game you can even have some parameters for your new 'function'. Now change the OnSelect code of Button2 to:
UpdateContext({myParam: "Joost"});
Select(Button1)

And then add a third (and optional a fourth) button with the following OnSelect code:
UpdateContext({myParam: "Mark"});
Select(Button1)

Then change the OnSelect code of Button1 to:
Notify(
    "Hello " & myParam,
    NotificationType.Success,
    1000
)
And make Button1 invisible by setting the Visible property to false.

The Result
Now hit the play button again to see the preview of your changes. Click both visible buttons and see the Notify that now shows a different name for both buttons.
Testing Select with 'parameter'




















Conclusion
In this post you learned how NOT to repeat yourself with the Select function in Power Apps. Perhaps not the same features as in JavaScript, C# or any other language, but very useful to keep your code more clean. By hiding the buttons and giving them a descriptive, function-like name they could act like real programming functions.

An other great trick is to use the WITH function which is explained in an other post.