Microsoft BI Tools: BLOB

Showing posts with label BLOB_STORAGE. Show all posts

Monday, 7 November 2022

Create Data Lake containers and folders via DevOps

Case
I need a process to create Azure Data Lake containers throughout my DTAP environment of my Azure Data Platform. Manually is not an option because we want to minimize owner and contributor access to the Data Lake of acceptance and production, but Synapse and Data Factory don't have a standard activity to create ADL containers. How to automatically create Azure Data Lake Containers (and folders) ?

Storage Account (datalake) containers

Solution
An option is to use a PowerShell script that is executed by the Custom activity in combination with an Azure Batch service. Or an Azure Automation runbook with the same PowerShell script that is executed by a Web(hook) activity.

However since you probably don't need create new containers during every (ADF/Synapse) pipeline run, we suggest to do this via an Azure Devops Pipeline as part of your CICD proces with the same PowerShell script. You could either create a separte CICD pipeline for it or integrate it in your Synapse or ADF pipeline.

The example below creates containers and optionaly also folders and subfolders within these container. Synapse and Data Factory will create folders with forexample the Copy Data activity

1) Repos folder structure

For this example we use a CICD folder in the repos with subfolders for PowerShell, YAML and Json.

Repos folder structure

2) JSON config

Because we don't want to hardcode the containers and folders we use a JSON file as input for the PowerShell script. This JSON file is stored within the JSON folder of the DevOps Repository. We use the same JSON file for the entire environment, but you can ofcourse create a separate file for each environment if you need for example different containers on production. Our file is called config_storage_account.json

The folder array in this example is optional and when left empty no folders will be created. You can create subfolders within folders by separating them with a forwardslash.

{
"containers":   {
                "dataplatform":["folder1","folder2/subfolder1","folder2/subfolder2"]
                , "SourceX":["Actual","History"]
                , "SourceY":["Actual","History"]
                , "SourceZ":[]
                }
}

3) PowerShell code
The PowerShell script called SetStorageAccounts.ps1 is stored in the PowerShell folder and contains three parameters:

ResourceGroupName - The name of the resource group where the storage account is located.
StorageAccountName - The name the storage account
JsonLocation - The location of the json config file in the repos (see previous step)

It checks the existance of both the config file and the storage account. Then first loop through the containers from the config and within the container loop it loops through the folders of that specific container. For container names and folderpaths it does some small corrections for often made mistakes.

Note that the script will not delete containers and folders (or set authorizations to them). This is of course possible, but make sure to test this very thoroughly and even with testing a human error in configuring the config file is easy to make and could cause lots of data lose!

# This PowerShell will create the containers provided in the JSON file
# It does not delete of update containers and folders or set authorizations
param (
    [Parameter (Mandatory = $true, HelpMessage = 'Resource group name of the storage account.')]
    [ValidateNotNullOrEmpty()]
    [string] $ResourceGroupName,

    [Parameter (Mandatory = $true, HelpMessage = 'Storage account name.')]
    [ValidateNotNullOrEmpty()]
    [string] $StorageAccountName,

    [Parameter (Mandatory = $true, HelpMessage = 'Location of config_storage_account.json on agent.')]
    [ValidateNotNullOrEmpty()]
    [string] $JsonLocation
 )

# Combine path and file name for JSON file. The file name is hardcoded and the
# same for each environment. Create an extra parameters for the filename if
# you need different files/configurations per environment.
$path = Join-Path -Path $JsonLocation -ChildPath "config_storage_account.json"
Write-output "Extracting containernames from $($path)"


# Check existance of file path on the agent
if (Test-Path $path -PathType leaf) {
    
    # Get all container objects from JSON file
    $Config = Get-Content -Raw -Path $path | ConvertFrom-Json

    # Create containers array for looping
    $Config | ForEach-Object { 
        $Containers = $($_.containers) | Get-Member -MemberType NoteProperty | Select-Object -ExpandProperty Name
    }
    
    # Check Storage Account existance and get the context of it
    $StorageCheck = Get-AzStorageAccount -ResourceGroupName $ResourceGroupName -Name $StorageAccountName -ErrorAction SilentlyContinue        
    $context = $StorageCheck.Context

    # If Storage Account found
    if ($StorageCheck) {
        # Storage Account found
        Write-output "Storage account $($StorageAccountName) found"

        # Loop through container array and create containers if the don't exist
        foreach ($container in $containers) {
            # First a little cleanup of the container

            # 1) Change to lowercase
            $container = $container.ToLower()

            # 2) Trim accidental spaces
            $container = $container.Trim()


            # Check if container already exists
            Write-output "Checking existence of container $($container)"
            $ContainerCheck = Get-AzStorageContainer -Context $context -Name $container -ErrorAction SilentlyContinue 

            # If container exists
            if ($ContainerCheck) {
                Write-Output "Container $($container) already exists"
            }
            else {
                Write-Output "Creating container $($container)"
                New-AzStorageContainer -Name $container -Context $context | Out-Null
            }

            # Get container folders from JSON
            Write-Output "Retrieving folders from config"
            $folders = $Config.containers.$container
            Write-Output "Found $($folders.Count) folders in config for container $($container)"

            # Loop through container folders
            foreach ($folder in $folders) {
                # First a little cleanup of the folders

                # 1) Replace backslashes by a forward slash
                $path = $folder.Replace("\","/")

                # 3) Remove unwanted spaces
                $path = $path.Trim()
                $path = $path.Replace("/ ","/")
                $path = $path.Replace(" /","/")

                # 3) Check if path ends with a forward slash
                if (!$path.EndsWith("/")) {
                    $path = $path + "/"
                }
                    
                # Check if folder path exists
                $FolderCheck = Get-AzDataLakeGen2Item -FileSystem $container -Context $context -Path $path  -ErrorAction SilentlyContinue
                if ($FolderCheck) {
                    Write-Output "Path $($folder) exists in container $($container)"
                } else {
                    New-AzDataLakeGen2Item -Context $context -FileSystem $container -Path $path -Directory | Out-Null
                    Write-Output "Path $($folder) created in container $($container)"
                }
            }

        }
    } else {
        # Provided storage account not corrrect
        Write-Output "Storageaccount: $($StorageAccountName) not available, containers not setup."
    }
} else {
    # Path to JSON file incorrect
    Write-output "File $($path) not found, containers not setup."
}

4) YAML file.

If you integrate this in your existing Data Factory or Synapse YAML pipeline then you only need to add one PowerShell step. Make sure you have a checkout step to copy the config and powershell file from the repository to the agent. You may also want to add a (temporary) treeview step to check the paths on your agent. This makes it easier to configure paths within your YAML code.

parameters:
  - name: SerCon
    displayName: Service Connection
    type: string
  - name: Env
    displayName: Environment
    type: string
    values: 
    - DEV
    - ACC
    - PRD
  - name: ResourceGroupName
    displayName:
    type: string
  - name: StorageAccountName
    displayName:
    type: string

jobs:
    - deployment: deploymentjob${{ parameters.Env }}
      displayName: Deployment Job ${{ parameters.Env }} 
      environment: Deploy to ${{ parameters.Env }}

      strategy:
        runOnce:
          deploy:
            steps:
            ###################################
            # 1 Check out repository to agent
            ###################################
            - checkout: self
              displayName: '1 Retrieve Repository'
              clean: true 

            ###################################
            # 3 Show environment and treeview
            ###################################
            - powershell: |
                Write-Output "Deploying Synapse in the ${{ parameters.Env }} environment"
                tree "$(Pipeline.Workspace)" /F
              displayName: '2 Show environment and treeview Pipeline_Workspace'

            ###################################
            # 3 Create containers in datalake
            ###################################
            - task: AzurePowerShell@5
              displayName: '3 Create data lake containers'
              inputs:
                azureSubscription: ${{ parameters.SerCon }}
                scriptType: filePath
                scriptPath: $(Pipeline.Workspace)\s\CICD\PowerShell\SetStorageAccounts.ps1
                scriptArguments:
                  -ResourceGroupName ${{ parameters.ResourceGroupName }} `
                  -StorageAccountName ${{ parameters.StorageAccountName }} `
                  -JsonLocation $(Pipeline.Workspace)\s\CICD\Json\
                azurePowerShellVersion: latestVersion
                pwsh: true

5) The result

Now it's time to run the YAML pipeline and check the Storage Account to see wether the containers and folders are created.

DevOps logs of creating containers and folders

Created data lake folders in container

Conclusion

In this post you learned how to create containers and folders in the Storage Account / Data Lake via a little PowerShell script and a DevOps pipeline, but you can also reuse this PowerShell script in for the mentioned alternative solutions.

Again the note about also deleting containers and folders. Make sure to double check the code, but also the procedures to avoid human errors and potenially loose a lot of data. You might want to setup soft deletes in your storage account to have a fallback scenario for screwups.

Sunday, 3 May 2020

Use Managed Service Identity for Synapse PolyBase

Case
I a previous PolyBase example, to read data from the storage account, we stored the secret in Synapse. Is it possible to use Managed Identity instead of storing secrets in Synapse? However I can not find the Managed Identity of my SQL Server.

Azure Synapse Analytics with PolyBase reading Azure Storage Account

Solution
You can use a Managed Identity, but there are two requirements. First this only works with 'StorageV2 (general purpose v2)'. 'BlobStorage' or 'Storage (general purpose v1)' will not work! Secondly you need to register your SQL Server that hosts Synapse in your Active Directory. This will allow you to select your SQL Server within the Access control (IAM).

1) Create Storage Account
Create an Azure Storage Account and make sure the type is StorageV2 (general purpose v2). The storage account of this example is called 'bitoolsstorage' and it has a container called 'mycontainer'. You can choose your own names, but these names will be used in the example code.

Bijschrift toevoegen

2) Create Synapse
Create a Synapse Data Warehouse including a SQL Server to host it. Our SQL Server is called 'bitoolssynapseserver' and our Synapse SQL Pool (data warehouse) is called 'synapsedwh'. Again choose your own names and change those in the example code below.

Synapse SQL Pool (data warehouse)

3) Register SQL Server in AD
Next step is to register the SQL Server that hosts your Synapse DWH in the Active Directory. This will allow you to find your SQL Server in the next step as a Managed Identity. At the moment of writing this needs to be done via PowerShell and cannot be done via the portal.

We will be using Cloud Shell (PowerShell in the portal), but you can also use PowerShell (ISE) on your Windows device but then you have to execute two extra commands (login and select subscription).

Click on the Cloud Shell icon in the upper right corner (next to the searchbox). This will start PowerShell in the portal. If this is the first time using it you first need to connect it to an Azure Storage Account.
Then execute the Set-AzSqlServer command. The first parameter is the resource group where SQL Server is located. The second parameter is the name of SQL Server (without .database.windows.net) and the last parameter will assign the Managed Identity.

# PowerShell
Set-AzSqlServer -ResourceGroupName "Joost_van_Rossum" -ServerName "bitoolssynapseserver" -AssignIdentity

If you are using PowerShell on your Windows device instead of Cloud Shell then use this code

# PowerShell
# Login to Azure (popup will appear)
Connect-AzAccount

# Select your subscription
Select-AzSubscription -SubscriptionId "2c67b23a-4ba2-4273-bc82-274a743b43af"

# Assign Managed Identity
Set-AzSqlServer -ResourceGroupName "Joost_van_Rossum" -ServerName "bitoolssynapseserver" -AssignIdentity

4) Storage Blob Data Contributor
Now it's time to give your SQL Server access to the Azure Storage Account. The role we need for this according the documentation is 'Storage Blob Data Contributor', but I also tested it with 'Storage Blob Data Reader' and that works fine as well (since we are only reading data). Note: You need to be owner of the resource (group) to delegate access to others.

Go to your Storage Account from step 1
Click on Access control (IAM) in the left menu
Click on the + Add icon and choose Add role assignment
In the Role drop down select 'Storage Blob Data Contributor'
Leave the Assign access to drop down unchanged
In the Select box start typing the name of your SQL Server
Select your SQL Server and click on the Save button

Deligate Access to Managed Identity of SQL Server

5) Master Key
We are finished in the Azure portal and now its time to start with the actual PolyBase code. Start SQL Server Managed Studio (SSMS), but make sure your Synapse is not paused.

First step is to create a master key to encrypt any secrets, but only if you do not already have one (although we will not use any secrets). You can check that in the table sys.symmetric_keys. If a row exists where the symmetric_key_id column is 101 (or the name column is '##MS_DatabaseMasterKey##') then you already have a master key. Otherwise we need to create one. For Synapse a masterkey password is optional. For this example we will not use the password.

--Master key
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys WHERE symmetric_key_id = 101)
BEGIN
    PRINT 'Creating Master Key'
    CREATE MASTER KEY;
END
ELSE
BEGIN
    PRINT 'Master Key already exists'
END

6) Credentials
Next step is to create a credential which will be used to access the Storage Account. For a Managed Identity you don't use secrets:

--Credential
CREATE DATABASE SCOPED CREDENTIAL bitools_msi
WITH
    IDENTITY = 'Managed Service Identity'
;

Tip:
Give the credential a descriptive name so that you know where it is used for. You can find all credentials in the table sys.database_credentials:

--Find all credential
SELECT * FROM sys.database_credentials

7) External data source
With the credential from the previous step we will create an External data source that points to the Storage Account and container where your file is located. Execute the code below where:

TYPE = HADOOP (because PolyBase uses the Hadoop APIs to access the container)
LOCATION = the connection string to the container in your Storage Account starting with abfss.
CREDENTIAL = the name of the credentials created in the previous step.

--Create External Data Source
CREATE EXTERNAL DATA SOURCE bitoolsstorage_abfss
WITH (
    TYPE = HADOOP,
    LOCATION = 'abfss://mycontainer@bitoolsstorage.dfs.core.windows.net',
    CREDENTIAL = bitools_msi
);

Tip:
Give the external source a descriptive name so that you know where it is used for. You can find all external data sources in the table sys.external_data_sources:

--Find all external sources
SELECT * FROM sys.external_data_sources

Notice that the filename or subfolder is not mentioned in the External Data Source. This is done in the External Table. This allows you to use multiple files from the same folder as External Tables.

8) External File format
Now we need to describe the format used in the source file. In our case we have a comma delimited file. You can also use this file format to supply the date format, compression type or encoding.

--Create External Data Source
CREATE EXTERNAL FILE FORMAT TextFile
WITH (
    FORMAT_TYPE = DelimitedText,
    FORMAT_OPTIONS (FIELD_TERMINATOR = ',')
);

Tip:
Give the format a descriptive name so that you know where it is used for. You can find all external file formats in the table sys.external_file_formats:

--Find all external file formats
SELECT * FROM sys.external_file_formats

9) External Table
The last step before we can start quering, is creating the external table. In this create table script you need to specify all columns, datatypes and the filename that you want to read. The filename starts with a forward slash. You also need the datasource from step 7 and the file format from step 8.

--Create External table
CREATE EXTERNAL TABLE dbo.sensordata (
    [Date] nvarchar(50) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=bitoolsstorage_abfss, -- from step 7
    FILE_FORMAT=TextFile              -- from step 8
);

Note:
PolyBase does not like columnname headers. It will handle it like a regular data row and throw an error when the datatype doesn't match. There is a little workaround for this with REJECT_TYPE and REJECT_VALUE. However this only works when the datatype of the header is different than the datatypes of the actual rows. Otherwise you have to filter the header row in a subsequent step.

--Create External table with header
CREATE EXTERNAL TABLE dbo.sensordata2 (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=bitoolsstorage_abfss,
    FILE_FORMAT=TextFile,
    REJECT_TYPE = VALUE, -- Reject rows with wrong datatypes
    REJECT_VALUE = 1     -- Allow 1 failure (the header)
);

You can find all external tables in the table sys.external_tables.

--Find all external tables
SELECT * FROM sys.external_tables

However you can also find the External Table (/the External Data Source/the External File Format) in the Object Explorer of SSMS.

SSMS Object Explorer

10) Query external table
Now you can query the external table like any other regular table. However the table is read-only so you can not delete, update or insert records. If you update the source file then the data in this external table also changes instantly because the file is used to get the data.

--Testing
SELECT count(*) FROM dbo.sensordata;
SELECT * FROM dbo.sensordata;

Quering an external table

Conclusion
In this post you learned how to give the Managed Identity of SQL Server access to your Storage Account. This saves you some maintenance for the secrets. And you learned how to use PolyBase to read files from that Storage Account using the Managed Identity.

Tuesday, 26 November 2019

Archiving within a Blob Storage container

Case
New files are delivered to my Blob Storage container each day. After processing them I want to archive them into folders with a timestamp in the name and after x days delete them. Which Azure resource can I use for this task?

Container with archive folders

Solution
One of the services you could use is an Azure Automation runbook with some PowerShell code. Note that blob storage containers only have virtual folders which means that the foldername is stored in the filename. Microsoft Azure Storage Explorer will show it as if it are real folders.

1) Create Automation Account
First we need to create an Azure Automation account to host the runbook with PowerShell code.

Go to the Azure portal and create a new resource
Search for automation
Select Automation Account
Choose a useful name for the Automation Account
Select your Subscription, Resource Group and the Region
Decide whether you need an Azure Run As account and click on the Create button.
Note: We don't need it for this specific Runbook, because we will use the access key to access the Blob Storage Container. If you intend to add more runbooks to manage other Azure Resources you should probably enable it.

Create Azure Automation Account

2) Add Module Az.Storage
Before we start writing code we need to add a PowerShell module called Az.Storage. This module contains methods we need in our code.

If you forget this step you will get error messages while running your code that state that some of your commands are nog recognized:

New-AzStorageContext : The term 'New-AzStorageContext' is not recognized as the name of a cmdlet, function, script 
file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct 
and try again.

Go to the newly created Azure Automation Account
Go to Modules in the left menu
Click on the Browse Gallery button
Search for Az.Storage
Click on Az.Storage in the result and import this module

Adding a new module to your Automation Account

3) Create Runbook
Now we need to add a PowerShell Runbook to this new Automation Account.

Go to the overview page of your newly created Azure Automation Account
Click on Runbooks in the left menu
Click on the + Create a runbook button to create a new Runbook
Enter a descriptive name for the Runbook
Select PowerShell as Runbook type
Optionally add a description and click on the Create button

Create Runbook

4) Edit Runbook code
Now first look up the Storage Account Name and key (1 or 2) from your storage account that you want to cleanup with this Archive script. In the first example we will hardcode these two parameters in the PowerShell code it self. The second example will show you an alternative.

Lookup Account Name and Key

Next open the new Runbook if it wasn't already opened by the previous step. Copy the code below and paste it in the editor. Then study the code and its comments to understand the code and to make sure we don't steal your data. If you never want to delete files then just remove the cleanup part starting on row 74.

Since the storage account container doesn't have real (sub)folders, we will use virtual folders which means the foldername is actually stored in the filename. Therefore we cannot move files to a subfolder, but we can rename them instead. Unfortunately renaming a blob is not a standard method available in the Azure Storage module. Therefore we will use a custom function from Martin Brandl that makes a copy with the new name and deletes the original.

# PowerShell code

##########################################################################
############################### PARAMETERS ###############################
##########################################################################
$StorageAccountName = "bitools"
$StorageAccountKey = "XTZSqCcF7q43SwX3FAKEgG5ezCC3l5jor5gUajoTWnnk4qFAKEk52UuR3lYqw1eaFAKEbQe3M4CpSbGDVnSCKg=="
$ContainerName = "sensordata"
$DaysHistory = 31


##########################################################################
########################## Rename-AzStorageBlob ##########################
##########################################################################
# Custom function because Rename method does not exist in Storage modules
# Function copied from about-azure.com and adjusted from AzureRM to Az
# https://about-azure.com/2018/02/13/rename-azure-storage-blob-using-powershell/

function Rename-AzStorageBlob
{
    [CmdletBinding()]
    Param
    (
        [Parameter(Mandatory=$true, ValueFromPipeline=$true, Position=0)]
        [Object]$Blob,
        [Parameter(Mandatory=$true, Position=1)]
        [string]$NewName
    )


  Process {
    $blobCopyAction = Start-AzStorageBlobCopy `
        -ICloudBlob $Blob.ICloudBlob `
        -DestBlob $NewName `
        -Context $Blob.Context `
        -DestContainer $Blob.ICloudBlob.Container.Name
 
    $status = $blobCopyAction | Get-AzStorageBlobCopyState
 
    while ($status.Status -ne 'Success')
    {
        $status = $blobCopyAction | Get-AzStorageBlobCopyState
        Start-Sleep -Milliseconds 50
    }
 
    $Blob | Remove-AzStorageBlob -Force
  }
}


##########################################################################
############################### ARCHIVING ################################
##########################################################################
Write-Output "Start archiving $($ContainerName) in $($StorageAccountName)"

# Get 'context' of the source container
$StorageContext = New-AzStorageContext -StorageAccountName $StorageAccountName -StorageAccountKey $StorageAccountKey

# Determine prefix of datetime
$Prefix = (get-date).ToString("yyyyMMdd-HHmmssfff")

# Loop through all blobs that are not in a virtual folder
$blobs = Get-AzStorageBlob -Container $ContainerName -Context $StorageContext | where {$_.Name -notlike "*/*" }
foreach ($blob in $blobs)
{
    $NewName = $Prefix + "\" + $blob.name
    Write-Output "Archive $($blob.name) to $($NewName)"
   
    #Rename blob file.csv => 20190226\file.csv
    Get-AzStorageBlob -Container $ContainerName -Context $StorageContext -Blob $blob.name  | Rename-AzStorageBlob -NewName $NewName
}


##########################################################################
################################ CLEANUP #################################
##########################################################################
Write-Output "Start cleanup $($ContainerName) in $($StorageAccountName)"

# Determine date in history based on given days
$HistoryLimit = (get-date).AddDays(-$DaysHistory).ToString("yyyyMMdd-HHmmssfff")

# Get all blobs where the names are at least 19 chararters long
# and from that set get the files that matches 99999999-999999999/
# example: 20191123-171423559/filename.csv
$blobs = Get-AzStorageBlob -Container $ContainerName -Context $StorageContext | where {$_.Name.Length -gt 19} | where {($_.Name).Substring(0,19) -match "^\d{8}[-]\d{9}/$"}

# Loop through the set with blobs that matches our requirements
# and delete those from the container
foreach ($blob in $blobs)
{
    if ( ($blob.name).Substring(0,8) -lt $HistoryLimit)
    {
        Write-Output "Remove old blob $($blob.name)"
        Remove-AzStorageBlob -Context $StorageContext -Blob $blob.name -Container $ContainerName
    }
}

Not very flexible and secure to store your Storage Accountname and Accesskey in the code it self. So in this second example we will replace it by parameters that can be passed from Azure Data Factory. You can find more details in this blog post. Only the parameter part is different and the rest of the code is unchanged.

# PowerShell code

##########################################################################
############################### PARAMETERS ###############################
##########################################################################
Param
(
    # ContainerName is required
    [Parameter(Mandatory=$False,Position=1)]
    [object] $WebhookData
)

# Get all parameters from body (passed from Data Factory Web Activity)
$Parameters = (ConvertFrom-Json -InputObject $WebhookData.RequestBody)

# Get single parameter from set of parameters
$ContainerName = $Parameters.ContainerName

$StorageAccountName = $Parameters.StorageAccountName
$StorageAccountKey = $Parameters.StorageAccountKey
$ContainerName = $Parameters.ContainerName
$DaysHistory = $Parameters.DaysHistory

The parameters that will be provided in Azure Data Factory (ADF) via a JSON message will look like this:

{
"StorageAccountName":"BiTools",
"StorageAccountKey ":"XTZSqCcF7q43SwX3FAKEgG5ezCC3l5jor5gUajoTWnnk4qFAKEk52UuR3lYqw1eaFAKEbQe3M4CpSbGDVnSCKg==",
"ContainerName":"SensorData",
"DaysHistory":"31"
}

Another alternative, for when you don't want to use ADF, is using the Azure Key Vault to store the key and then use some PowerShell code to retrieve it in your runbook. The code is very simple, but see this post for more details.

# PowerShell code snippet
# Retrieve value from Key Vault
$StorageAccountKey = (Get-AzKeyVaultSecret -VaultName "MyKeyVault" -Name "StorageAccountKey").SecretValueText

5) Testing
Testing the functionality of your code is the easiest if you still have the hardcoded parameters. Then you can just use the Test pane in the Runbook editor like the animation below. If you want to test it with the parameters for ADF then you first need to create a webhook and then create and run an ADF pipeline to test your code.

Testing the code

Summary
In this post we explained how to archive files in a container from a Storage Account using PowerShell. We also showed how to pass the parameters like the account key from ADF instead of hard coding it. However we also don't want to store secrets in ADF. In a next post we will show you how to store the Storage account key in an Azure Key Vault and show how to read that secret within ADF to use it as a parameter for this Runbook.

Passing secrets parameters using KeyVault, DataFactory and a Runbook

Thursday, 31 May 2018

Snack: Executing SSIS packages in Azure Logic App

Case
Can I use a Logic App to execute an SSIS package located in the Azure Integration Runtime environment (ADF V2)? Logic App has several triggers that could be interesting as a start moment to execute an SSIS package. For example when a new file is added in a Blob Storage Container, a DropBox, a OneDrive or an (S)FTP folder. How do I do that?

Logic App

Solution
If you want to trigger something to start an SSIS package instead of scheduling it in ADF, then Logic App could be very handy. Of course you can accomplish the same in SSIS with some custom code or perhaps a Third Party component, but Logic App is much easier and probably cheaper as well.

For this example we will start an existing package when a new file is added to a certain Azure Blob Storage container. The Blob Storage Container and the SSIS package in the Integration Runtime environment already exist.

1) Logic App
Create a new Logic App by clicking on the + sign (Create new resource) on the azure portal. It is located under Integration. Give it a descriptive name like "ExecuteSsisWhenBlobFileIsCreated". After the creation of the Logic App choose the Blank Logic App as template to start.

Create new Logic App

2) Azure Blob Storage trigger
For this example we will be using a trigger on a Azure Blob Storage container: "When a blob is added or modified (properties only) (Preview)". Search for Blob and you will find the right trigger. Create a connection to the right Azure Blob Storage (if you already have created one within the same resource group then that one will be reused). Select the correct container and set the interval to your own needs.

Logic App Blob Storage Trigger

3) SQL Server Execute Query
The next step is to create SQL code that executes our package(s). You can easily create the SQL for this in SSMS. Go to your package in the Catalog. Right click it and choose Execute... Now set all options like Logging Level, Environment and 32/64bit. After setting all options hit the Script button instead of the Ok button. This is the code you want to use. You can finetune it with some code to check whether the package finished successfully.

Generating code in SSMS

The code below was generated and finetuned. Copy the code below (or use your own code) to use it in the next step.

-- Variables for execution and error message
DECLARE @execution_id bigint, @err_msg NVARCHAR(150)

-- Create execution and fill @execution_id variable
EXEC [SSISDB].[catalog].[create_execution] @package_name=N'Package.dtsx', @execution_id=@execution_id OUTPUT, @folder_name=N'SSISJoost', @project_name=N'MyAzureProject', @use32bitruntime=False, @reference_id=Null, @useanyworker=True, @runinscaleout=True

-- Set logging level: 0=None, 1=Basic, 2=Performance, 3=Verbose 
EXEC [SSISDB].[catalog].[set_execution_parameter_value] @execution_id,  @object_type=50, @parameter_name=N'LOGGING_LEVEL', @parameter_value=1 

-- Set synchonized option 0=A-SYNCHRONIZED, 1=SYNCHRONIZED 
-- A-SYNCHRONIZED: don't wait for the result
EXEC [SSISDB].[catalog].[set_execution_parameter_value] @execution_id,  @object_type=50, @parameter_name=N'SYNCHRONIZED',  @parameter_value=1 

-- Execute the package with parameters from above
EXEC [SSISDB].[catalog].[start_execution] @execution_id, @retry_count=0

-- Check if the package executed succesfully (only for SYNCHRONIZED execution)
IF(SELECT [status] FROM [SSISDB].[catalog].[executions] WHERE execution_id=@execution_id)<>7
BEGIN
 SET @err_msg=N'Your package execution did not succeed for execution ID: ' + CAST(@execution_id AS NVARCHAR(20))
 RAISERROR(@err_msg,15,1)
END

Now back to the Logic App. Add a new action called SQL Server Execute Query (not stored procedure) and create a connection to the SSISDB where your packages are located. Paste your code to execute your package in the query field of this new action.

Action SQL Server Execute Query

Note: When creating your SSISDB in ADF, make sure the option 'Allow Azure services to access' is turned on.

4) Testing
Save your Logic App and add a new file to the selected Blob Storage Container. Then watch the Runs history of the Logic App and the Execution Report in the Integration Services Catalog to view the result.

Summary
This post explains how to execute an SSIS package with a trigger in Logic App instead of scheduling it in Azure Data Factory. If you are using the Stored Procedure Activity in ADF to execute SSIS packages then you can reuse this code. In a next post we will see an alternative for Logic App.

Note: steps to turn your Integration Runtime off or on can be added with an Azure Automation action.

Sunday, 20 August 2017

Use PolyBase to read Blob Storage in Azure SQL DW

Case
I have a file in an Azure Blob Storage container which I want to use in my Azure SQL Data Warehouse. How can I push the content of that file to Azure SQL DW?

One of the options: PolyBase

Solution
You could of course use an ETL product or Azure Data Factory, but you can also use PolyBase technology in Azure SQL DW and use that file as an external table. The data stays in the Azure Blob Storage file, but you can query the data like a regular table.

Starting position
Starting position is a file in an Azure Blob Storage container. This file was created with U-SQL in an other post to quickly process large amounts of files in Azure.

The content of the CSV file

Before we start, make sure your Azure SQL Data Warehouse is started and use SQL Server Management Studio (SSMS) to connect to your Data Warehouse. Notice that the icon of a SQL DW is different than SQL DB.

Icon SQL DW vs SQL DB

1) Master key

In the next step we will use a credential that points to the Azure Blob Storage. To encrypt that credential, we first need to create a master key in our Azure SQL Data Warehouse, but only if you do not already have one. You can check that in the table sys.symmetric_keys. If a row exists where the symmetric_key_id column is 101 (or the name column is '##MS_DatabaseMasterKey##') then you already have a master key. Else we need to create one. For Azure SQL Data Warehouse a password for that master key is optional. For this example we will not use the password.

--Master key
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys WHERE symmetric_key_id = 101)
BEGIN
    PRINT 'Creating Master Key'
    CREATE MASTER KEY;
END
ELSE
BEGIN
    PRINT 'Master Key already exists'
END

2) Credentials
Next step is to create a credential which will be used to access the Azure Blob Storage. Go to the Azure portal and find the Storage Account that contains your blob file. Then go to the Access keys page and copy the key1 (or key2).

Access keys

Then execute the following code where IDENTITY contains a random string and SECRET contains the copied key from your Azure Storage account.

--Credential
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
    IDENTITY = 'user',
    SECRET = 'JGadV/tAt1npuNwkiH9HnI/wosi8YS********=='
;

Tip: give the credential a descriptive name so that you know where it is used for. You can find all credentials in the table sys.database_credentials:

--Find all credential
SELECT * FROM sys.database_credentials

3) External data source
With the credential from the previous step we will create an External data source that points to the Azure Blob Storage container where your file is located. Execute the code below where:

TYPE = HADOOP (because PolyBase uses the Hadoop APIs to access the container)
LOCATION = the connection string to the container (replace [ContainerName] with the name of the container and [StorageAccountName] with the name of your storage account).
CREDENTIAL = the name of the credentials created in the previous step.

--Create External Data Source
CREATE EXTERNAL DATA SOURCE AzureStorage
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://[ContainerName]@[StorageAccountName].blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
);

Tip: give the external source a descriptive name so that you know where it is used for. You can find all external data sources in the table sys.external_data_sources:

--Find all external sources
SELECT * FROM sys.external_data_sources

Notice that the filename is not mentioned in the External Data Source. This is done in the External Table. This allows you to use multiple files from the same container as External Tables.

Filename not in External Data Source

4) External File format
Now we need to describe the format used in the source file. In our case we have a comma delimited file. You can also use this file format to supply the date format, compression type or encoding.

--Create External Data Source
CREATE EXTERNAL FILE FORMAT TextFile
WITH (
    FORMAT_TYPE = DelimitedText,
    FORMAT_OPTIONS (FIELD_TERMINATOR = ',')
);

Tip: give the format a descriptive name so that you know where it is used for. You can find all external file formats in the table sys.external_file_formats:

--Find all external file formats
SELECT * FROM sys.external_file_formats

5) External Table
The last step before we can start quering, is creating the external table. In this create table script you need to specify all columns, datatypes and the filename that you want to read. The filename starts with a forward slash. You also need the datasource from step 3 and the file format from step 4.

--Create External table
CREATE EXTERNAL TABLE dbo.sensordata (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=AzureStorage, -- from step 3
    FILE_FORMAT=TextFile      -- from step 4
);

Note: PolyBase does not like columnname headers. It will handle it like a regular data row and throw an error when the datatype doesn't match. There is a little workaround for this with REJECT_TYPE and REJECT_VALUE. However this only works when the datatype of the header is different than the datatypes of the actual rows. Else you have to filter the header row in a subsequent step.

--Create External table with header
CREATE EXTERNAL TABLE dbo.sensordata5 (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=AzureStorage,
    FILE_FORMAT=TextFile,
    REJECT_TYPE = VALUE, -- Reject rows with wrong datatypes
    REJECT_VALUE = 1     -- Allow 1 failure (the header)
);

You can find all external tables in the table sys.external_tables.

--Find all external tables
SELECT * FROM sys.external_tables

However you can also find the External Table (/the External Data Source/the External File Format) in the Object Explorer of SSMS.

SSMS Object Explorer

6) Query external table
Now you can query the external table like any other regular table. However the table is read-only so you can not delete, update or insert records. If you update the source file then the data in this external table also changes instantly because the file is used to get the data.

--Testing
SELECT count(*) FROM dbo.sensordata;
SELECT * FROM dbo.sensordata;

Quering an external table

7) What is next?
Most likely you will be using a CTAS query (Create Table As Select) to copy and transform the data to an other table since this is the fasted/preferred way in SQL DW. In a subsequent post we will explain more about CTAS, but here is how a CTAS query looks like.

--CTAS
CREATE TABLE [dbo].[Buildings]
WITH
(
    DISTRIBUTION = ROUND_ROBIN
,   CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT  [location]
,       [date]
,       [temp]
,       [hmdt]
FROM    [dbo].[sensordata]
;

In some cases you could also use an SELECT INTO query as an alternative for CTAS.

Summary
In this post you saw how easy it was to read a file from the Azure Blob Storage and use it as a table in Azure SQL Data Warehouse. The big advantage of PolyBase is that you only have one copy of the data because the data stays in the file. In a next post we will see how to read the same file from the Azure Data Lake Store which does not use the Access keys.
In an other post we will explain the basic usage of the CTAS query which is the preferred way to handle large sets of data in Azure SQL DW and in its on-premises precursor APS (a.k.a. PDW).

Wednesday, 23 November 2016

Azure - Understanding Stream Analytics Blob Destination

Case

How do you use Azure Blob Storage to store stream analytics data and why would you do that?

Solution
In earlier posts we have set up an IoT environment with an IoT Hub and a couple of Stream Analytics Jobs where sensor data is sent to different destinations: Power BI for a real-time dashboard and Azure SQL Database to store the data. In this case we create a new Stream Analytics Job with Blob Storage as output and we use the sensor data as input. The Input and Output are different, but the query is the same as other Stream Analytics Jobs. Configuring the query of the Stream Analytics Job can you find here.

Reasons to use Blob Storage is the diversity of storing in the cloud of text or binary files. The unstructured data can include documents, social data (photos, videos, music and blogs), Big Data (logs, IoT and large datasets) or images and text for web applications. Click here for more pricing details about Blob Storage. You can also store (sensor) data in an Azure Data Lake. Click here for more information about this, along with the major differences between Blob Storage and Data Lake.

For the use of Azure Blob storage you need a Storage account. Then you can add containers to this account, which include the Blob files. In the case of an image, you also find a metadata file. In our case the Blob files are JSON files and contain sensor data.

Overview Azure Blob Storage with images

1) Create a Storage account
To make use of Blob Storage you have to create a Storage account first. After we give it a suitable name, we choose Blob storage and 'RA-GRS' as Replication. This is the default and it contains the most options. Click here for more information about this. Next we choose 'Hot' by Access tier, because we want access the sensor data frequently.

Azure Portal - Create a Storage account

Note:
You can also create a Storage account when setting up the Output of the Stream Analytics Job, but you have less options so it is not recommended.

2) Create the Stream Analytics Job

Before creating the new job, we had already add a new consumer group to our IoT Hub. We called it 'blob'. Using multiple consumer groups makes it possible for several consumer applications to read data from this IoT Hub independently. Click here to see where you can add/manage consumer group(s).

Now we can create the job. We choose the same Resource group as the IoT Hub and Storage account, because these are in the same life cycle. Our Location is the Netherlands, so we choose West-Europe.

Azure Portal - Create Stream Analytics Job

3) Configure the Stream Analytics Job

First we must add a new Input to the job. The default Source Type is 'Data stream'. We choose this because the sensor data is an ongoing stream and is derived from the IoT Hub. The Source is 'IoT hub' and then the IoT Hub that you have created automatically appears. If you have more then one IoT Hub, you can choose one from the drop-down list. After this you must choose the right Consumer group. This is the new group (blob) we have created earlier. Finally you choose 'JSON' as Event serialization format.

Next we add a new Output. First choose 'Blob storage' in Sink and 'Use blob storage from current subscription' as Subscription, because you have configured the storage account earlier. Otherwise you can edit the storage account settings here by choosing 'Provide blob storage settings manually'. Then you create a new container. Optionally, you can define one or more instances (subfolders) within the container. With this option you can change the date and time format so you can have multiple instances including different dates and/or time folders. This makes it more clear (just like your own local File Explorer) and you have the choice to select data from a specific day/time. We made a instance called 'sensor'. At last you will choose the 'JSON' format.

As we said, we discuss only the configuration of the Input and Output of the job in this post. After configured the job we will run the job with a valid query.

Azure Portal - Configure the Stream Analytics Job for Blob

Result

Now the data is stored, we want to see what's in our Blob. Therefore you have to go to your Azure Storage account in the portal. Every object, in our case a Blob, that you store in Azure Storage has a unique URL address. For the Blob service with the storage account name (bitoolsblobstorage) you have created and the container name (sensordata) with the instance (sensor) the URL/endpoint is:

http://bitoolsblobstorage.blob.core.windows.net/sensordata/sensor

More information about the Azure Storage endpoints here.

In the portal go to the Storage account you have made earlier. Click on the container URL and then you see the instance (subfolder). Now you can drill down further on the specific month, day and hour of the incoming sensor data.

Azure Portal - Your Blob file

Conclusion

It looks a lot like the other Stream Analytics posts, but in this case the data is stored in a Blob file. From this point you have several options to do something with this data. First you can do nothing off course and in that case you use Blob purely for storage (backup). You will find the other options in the Cortana Intelligence Suite. For example process the data with Azure Data Factory (this can also be done with traditional SSIS, which is off course not a part of the CIS), analyse the data with Machine Learning or visualize the data in Power BI.