Sunday 5 December 2021

Delta Lake support in Azure Synapse Analytics

Case
Delta Lake is already widely used in Azure Data Bricks, but now it is also available in Synapse Analytics. How can I use it there to store history and do 'time travel' in my historical data?
Synapse now supports Delta Lake











Solution
Delta Lake is now so called General Available (GA) in Synapse Analytics, but at the time of writing Microsoft is still implementing new Delta Lake features in Synapse. 

This example uses a Synapse Pipeline with a Copy Data Activity to ingest data from the source and then calls a Python Delta Lake script (other languages are possible) via the Notebook activity.












Prerequisites
You need to make sure that you (for debugging) and the Managed Service Identity (MSI) of your Synapse Analytics workspace have access to the Azure Data Lake with the Role Storage Blob Data Contributor.
  • In the Azure Portal go to the Storage Account used by the Synapse Analytics workspace
  • In the left menu click on Access Control (IAM)
  • Click on + Add and choose Add role assignment
  • Search for Storage Blob Data Contributor, select the role and click on Next
  • Click on + Select members and find your Synapse workspace and find yourself and click Select
  • Optionally add an description about the why. Then click on Review + assign (twice)
At time of writing our Apache Spark Pool uses version 3.1 with Delta Lake 1.0. If you are using an older version (2.4) of Spark then you get Delta Lake version 0.6 which is slightly different. If newer versions appear then just try the newest Spark Pool.
Apache Spark pool Additional Settings




















1) Code cell 1: parameters
The first code cell is for the parameters that can be overridden by parameters from the Notebook activity in the pipeline. For more details see our post about notebook parameters. For debugging within the notebook we used real values.
# path of the data lake container
data_lake_container = 'abfss://yourbronzecontainer@yourdatalake.dfs.core.windows.net'

# The ingestion folder where your parquet file are located
ingest_folder = 'parquetstage'

# The bronze folder where your Delta Tables will be stored
bronze_folder = 'bronze'

# The name of the table
table_name = 'residences'

# The wildcard filter used within the bronze folder to find files
source_wildcard = 'residences*.parquet'

# A comma separated string of one or more key columns (for the merge)
key_columns_str = 'Id'
Parameters









2) Code cell 2: import modules and functions
The second code cell is for importing all required/useful modules. For this basic example we two import s:
  • DeltaTable.delta.tables for handling delta tables
  • notebookutils for file system utilities (removing delta table folder)
# Import modules
from delta.tables import DeltaTable
from notebookutils import mssparkutils
Imports







3) Code cell 3: filling delta lake
Now the actual code for filling the delta lake tables with parquet files from the data lake. Note: code is very basic. It checks whether the Delta Lake table already exists. If not it creates the Delta Lake table and if it already exists it merges the new data into the existing table. If you have transactional data then you could also do an append instead of a merge.

# Convert comma separated string with keys to array
key_columns = key_columns_str.split(',')  
 
# Convert array with keys to where-clause for merge statement
conditions_list = [f"existing.{key}=updates.{key}" for key in key_columns]
 
# Determine path of source files from ingest layer
source_path = os.path.join(data_lake_container_bronze, ingest_folder, source_wildcard)
 
# Determine path of Delta Lake Table 
delta_table_path = os.path.join(data_lake_container_bronze, bronze_folder, table_name)

# Read file(s) in spark data frame
sdf = spark.read.format('parquet').option("recursiveFileLookup", "true").load(source_path)
 
# Check if the Delta Table exists
if (DeltaTable.isDeltaTable(spark, delta_table_path)):
    print('Existing delta table')
    # Read the existing Delta Table
    delta_table = DeltaTable.forPath(spark, delta_table_path)
 
    # Merge new data into existing table
    delta_table.alias("existing").merge(
        source = sdf.alias("updates"),
        condition = " AND ".join(conditions_list)
         
    ).whenMatchedUpdateAll(
    ).whenNotMatchedInsertAll(
    ).execute()
 
    # For transactions you could do an append instead of a merge
    # sdf.write.format('delta').mode('append').save(delta_table_path)
 
else:
    print('New delta table')
    # Create new delta table with new data
    sdf.write.format('delta').save(delta_table_path)
Adding file to Delta Lake
















4) Viewing the Delta Table in notebook
If you run the notebook with the code of the first three steps a couple of times with changed/extra/less records then history will be build in the delta table. For debugging purposes you can add an extra code cell to view the data and the various versions of the data.

To check the current version of the data you can use the following code:
display(spark.read.format('delta').load(delta_table_path))
Get current version of data













And with this code you can investigage the history versions of the data. In this case there are two versions:
# Get all versions
delta_table = DeltaTable.forPath(spark, delta_table_path)
display(delta_table.history())
Get versions of data








To retrieve one specific version you could use something like this (where the 0 is the version from the above picture):
# Get one specific version
display(spark.read.format("delta").option("versionAsOf", "0").load(delta_table_path))
Get specific version of data












You can also use a datetime to retrieve data from the Delta Lake by using timestampAsOf instead of versionAsOf:
# Get one specific version with timestamp filter
display(spark.read.format("delta").option("timestampAsOf", "2021-12-05 19:07:00.000").load(delta_table_path))
Get specific version of data with datetime filter













To remove the entire Delta Lake table (and all history) you could use something like:
# Delete Delta Table (folder)
mssparkutils.fs.rm(delta_table_path, recurse=True)
Delete Delta Table







4) Viewing the Delta Table in Serverless SQL Pool
At the moment of writing you can query the Detla Lake in a Serverless SQL Pool, but you cannot yet use the 'time-travel' feature. Please upvote this feature here.

The first option is to use an OPENROWSET query within a SQL Script in your Synapse Workspace:
-- Query the Delta Lake
SELECT TOP 10 *
FROM OPENROWSET(
    BULK 'abfss://yourcontainer@yourdatalake.dfs.core.windows.net/deltalake/places/',
    FORMAT = 'delta') as rows
ORDER BY Id;
Query the Delta Lake via an OPENROWSET query



















A second option is using Polybase by creating an External Table on the Delta Lake. This does requery you to create a database within the Serverless SQL Pool because you can't do that on the master database.
-- Query the Delta Lake

-- Create database because it wont work on the master database
CREATE DATABASE MyDwh;

-- Create External Data Source
CREATE EXTERNAL DATA SOURCE DeltaLakeStorage
WITH ( location = 'abfss://yourcontainer@yourdatalake.dfs.core.windows.net/deltalake/' );

-- Create External File Format
CREATE EXTERNAL FILE FORMAT DeltaLakeFormat
WITH ( FORMAT_TYPE = DELTA );

-- Create External Table
CREATE EXTERNAL TABLE Residence (
     Id int,
     Residence VARCHAR(50)
) WITH (
        LOCATION = 'places', --> the root folder containing the Delta Lake files
        data_source = DeltaLakeStorage,
        FILE_FORMAT = DeltaLakeFormat
);

-- Get Data from your Delta Lake Table
SELECT          TOP 10 * 
FROM            Residence
ORDER BY        Id

























Conclusion
In this post you learned how to create and query a Delta Lake within your Synapse Analytics Workspace. The main advantage is of course that you now don't need Azure Data Bricks if you are already using Synapse. Making your Data Platform architecture just slightly more clearer and easier. 

A disadvantage, at the moment of writing, is the lack of time-traveling withing the Serverless SQL Pool environment. This means you're now forced to use notebooks to create your Data Warehouse when the latest version of your data is just not enough. So please upvote this feature here. There are some more limitations and know issues in the current version, but we think at least some of them will be solved in feature updates.

Thanks to colleague Jeroen Meidam for helping!


Wednesday 1 December 2021

ADF: Looping through pipelines and execute them

Case
I want to loop through my ADF pipelines and then execute them in a foreach loop, but the pipeline property of the Execute Pipeline activity doesn't support dynamic content. Is this possible?
Execute Pipeline





















Solution
The standard Execute Pipeline activity is pretty much useless for this specific case, but with another activity it is possible. However don't use this workaround to execute a whole bunch of similar pipelines like we used to do in the SSIS era. In that case it is just better to invest your time in creating a more flexible/configurable pipeline that can handle multiple tables or files.

Now the workaround:
Looping and executing pipelines













It uses a Web activity to get all pipelines via a Rest API. Then there is a Filter to get only a selection of all those pipelines. After that the Foreach loop with another Web activity in it will execute the pipelines via a Rest API call.

1) Access control (IAM)
This solution uses Rest APIs from Azure Data Factory. This means we need to give this ADF access to its own resources so that is can call those Rest APIs.
  • Go to your ADF in the Azure Portal
  • Click on the ellipsis button (three dots) to copy the ADF name
  • Click on Access control (IAM) in the left menu
  • Click on +Add and choose Add role assignment
  • Select the role with just enough access (less is more). Data Factory Contributor is perfect for this example
  • Then select members. In the search window you can paste your ADF name. Click on your ADF and then push the Select button
  • Now review and assign the role to your ADF
Give your ADF access to its own resources












2) Web activity - Get all pipelines
We need a collection of pipelines for the Foreach loop. The Rest API list-by-factory retrieves all pipelines from a single Data Factory. You need to prepare the Rest API url by replacing all parts between the curly braces with the info of your own ADF (also remove the curly braces):

https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/pipelines?api-version=2018-06-01

Tip: If you copy the URL of the ADF overview page in the Azure portal then you have all the information you need.
  • Add the Web activity to your pipeline and give it a suitable name (you need it in the next activity)
  • On the Settings tab enter the adjusted URL from above in the URL field
  • Select GET as method
  • Set Authenication to Managed Identity
  • Use https://management.azure.com/ in the Resource field
Web activity - Get all pipelines

























3) Filter pipelines
Now we have all pipelines available in our collection, but we need to add a filter to only get the required pipelines. The easiest way to do this is by putting them all in a specific folder or giving them all the same prefix. If you're using a folder then you can use the first expression. It first checks whether the pipeline contains a property named 'folder' because pipelines in the root will not have this property. Then it checks whether that property is filled with the value 'demo' (the name of our folder).
@and(
   contains(item().properties, 'folder'),
   equals(item().properties.folder.name,'demo')
)
If you want to use the prefix then the expression is less complex with only a startswith expression: @startswith(item().name, 'Sub_')

  • Add the Filter activity to the pipeline and give it a suitable name. We need it in the Foreach. Connect it to the Web activity.
  • For Items add the following expression @activity('Get All Pipelines').output.value (enter your own activity name)
  • For Condition add one of the above expressions
Filter activity






















4) Foreach loop
The foreach loop is very straightforward. Use the filter activity in the items field with an expression like this @activity('Filter on folder demo').output.value (replace the activity name).
Foreach activity

















5) Web activity - Execute pipeline
Within the foreach loop we need to add an other Web activity. This one will call the create-run Rest API which will execute an ADF pipeline. Just like in step 2 we need to adjust the example URL from the documentation:

https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/pipelines/{pipelineName}/createRun?api-version=2018-06-01

However we need to make it change every iteration of the Foreach because the pipeline name is part of the URL. You could create a expression where you only change that pipeline name and leave the rest hardcoded, but besides the pipeline name that collection of pipelines also contains a ID property that contains 90% of the URL we need. If you debug the pipeline and check the output of the first Web activity or the Filter activity you can check their output. The ID property is filled with something like:

/subscriptions/7q618f21-ad62-4e1d-9019-4a23beda7777/resourceGroups/RG_ADF_DEV/providers/Microsoft.DataFactory/factories/DataFactory2-DEV/pipelines/Sub_pipeline1

We only need to add something in front of it and behind of it and then you have the correct URL:
@concat('https://management.azure.com',
replace(item().id, ' ', '%20'),
'/createRun?api-version=2018-06-01'
)
The replace is to replace spaces, which are not allowed in a URL, by %20. So if a pipeline name contains a space then it will be replaced.
  • Add the Web activity to the Foreach and give it a suitable name 
  • On the Settings tab enter the expression above as URL by first clicking on the 'Add dynamic content' link below the field
  • Select POST as method
  • Now we don't need a Body for the Rest API but the Web activity requires it when the method is POST. Enter a dummy JSON message like: {dummy:"dummy"}
  • Set Authenication to Managed Identity
  • Use https://management.azure.com/ in the Resource field
The Web activity calls the pipelines asynchronous (execute and don't wait for an answer). If you want a synchronous call and perhaps get some feedback if the pipeline fails then you need to replace the Web activity by a Webhook activity.

6) The result
Now run the pipeline and check the result. Make sure to check the monitor. Then you will see one big disadvantage. Each execution will become a separate run with each its own Run ID. This means it will be a little bit more work to connect these in your logging, but it is possible because the output of the Web activity will return the Run ID.
ADF Monitor






Conclusion
In this post you learned how to execute pipelines in a loop via the Web activity and Rest APIs. Because each pipeline will get its own Run ID the logging needs some extra attention. You can also use the Web activity contruction to execute pipelines from an other Data Factory.

Unfortunately you cannot use this same trick for Data Flows because at te moment there is no Rest API to execute a pipeline (only in debug mode).




Saturday 27 November 2021

Synapse pipeline pass parameter to notebook

Case
I have a Synapse workspace notebook that I call from a Synapse pipeline, but I want to make it more flexible by adding parameters. How do you add parameters to a notebook and fill them via a pipeline?
Adding Parameters to your Synapse Notebook
























Solution
You can add variables to a special Code cell in the notebook and then use those as parameters within the Notebook activity. At the moment there is no real gui retrieving the parameters from the Notebook so you have to copy the names from the notebook to the Notebook activity in the pipeline.

1) Add Code cell for parameters
We need to add a Code cell and change it in to a parameter cell. Note that you can have only one parameter cell in your notebook. You want to add it somewhere at the top so that you can use its variables/parameters in the cells below this parameter cell.
  • Go to your notebook and add a new Code cell
  • Move it up. It should probably be your top code cell allowing you to use it in the cells below.
  • Click in the cell and then on the ellipsis button of that cell (button up right with three dots)
  • Choose Toggle parameter cell and you will see the word Parameters appear in the bottom right corner
Toggle parameter cell










2) Add variables to parameters cell
Next we need to add some code to the parameter cell. Here you just need to add some variables and then each variable can be overridden by the pipeline and be used in the cells below. For debugging it is usefull to give the variables a value. For this example we used python code.
Adding variables








3) Adjust Synapse Notebook activity
Last step is to edit the Synapse Notebook activity and add the parameters. For each variable you added to the parameters cell you can add a paramater in the notebook activity. At the moment there is no smart interface that lets you select a parameter and set its value. You have to set the name and datatype manually.
Adding parameters














4) Testing
Now run the pipeline to see the result. For this example we added a second Code cell with a print function to show that the default values have changed. Trigger the pipeline and go to the Monitor. Then click on your pipeline and within that pipeline on the Notebook activity. If you click on the pencil icon the notebook will open and allow you to see the result.
Click on pencil te open the Notebook













Note the extra cell and the result of the third cell















Conclusion
In this short post you learned how to add parameters to your notebook and fill them via the pipeline. And as an additional bonus you saw how to check the result of the changes. Next step is forexample to add the Notebook to a Foreach loop that ingest data to the datalake and then execute the notebook to create a Delta Lake table for each item in the Foreach loop.


Sunday 21 November 2021

ADF Release - Use parameters to enable Triggers

Case
During deployment of Azure Data Factory (ADF) via Azure DevOps pipelines I want to make sure that a certain trigger is only executed on Production and not on the lower environments. How can we do this without writing code (low-code)?

ADF Trigger



















Solution
This is possible by changing the ARM template parameter definition which on its turn will switch certain properties into overridable parameters during deployment. However, the triggers are not included by default in the parameter file. There is also a limitation that you cannot override every property, for example runtimeState to activate and deactivate the trigger. The workaround for this is to use the endTime property. 

More information about which properties are parameterized can be found here.

1) Understand parameters in ADF
Before we start overriding properties in the ARM template, it is good to understand the parameters in general. As you know, when start building your ADF, one of the first things you do is creating a Linked Service. By default, ADF knows that for example a connection string or using a Key Vault in a Linked Service should be parameterize, because the database server or the URL will be different per environment in a DTAP. The result is always two ARM template files: the content itself (ARMTemplateForFactory.json) and the parameters that can be overwritten (ARMTemplateParametersForFactory.json). Another file holds the definition of the parameters (arm-template-parameters-definition.json). 

When you start developing in a new ADF, the ARM template parameters file (result) only contains the ADF name that can be overwritten. When you have created a Linked Service, for example Azure Blob Storage, the file should look something like below. 
{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "factoryName": {
            "value": "bitools-d-adf-dwh"
        },
        "LS_ABLB_bitools_connectionString": {
            "value": ""
        }
    }
}
You can check this via "Manage - ARM template - Export ARM template".

ADF Portal - Check the parameters

















2) Check Trigger code
Now back to our trigger. Based on the documentation, we know which properties we can parameterize for a trigger. Lets have a look at the code of the trigger itself.
  • In the ADF portal go to Manage (toolbox icon in left menu) and then to Triggers
  • Find your trigger and hover your mouse on it and click on the code icon {}
ADF portal - Check the code of your trigger

See below the JSON code of the trigger. You can override everything that is related to typeProperties. Unfortunately the runtimeState property is not one of them.
{
    "name": "Trigger_Master",
    "properties": {
        "description": "Test",
        "annotations": [],
        "runtimeState": "Started",
        "pipelines": [
            {
                "pipelineReference": {
                    "referenceName": "PL_Master",
                    "type": "PipelineReference"
                }
            }
        ],
        "type": "ScheduleTrigger",
        "typeProperties": {
            "recurrence": {
                "frequency": "Day",
                "interval": 1,
                "startTime": "2021-01-01T00:00:00Z",
                "endTime": "2021-01-02T00:00:00Z",
                "timeZone": "UTC",
                "schedule": {
                    "minutes": [
                        10
                    ],
                    "hours": [
                        0
                    ]
                }
            }
        }
    }
}
Now that we have identified which properties can be parameterized, we need to know which property we want to override for our use case. As you know, we need to make sure the trigger should not be executed on every environment. One way to do this is to set the end date (and time) of a trigger. This property is called endTime. For example: a trigger with an end date on "01/02/2021 12:00 AM" will not be executed because this is in the past. When the end date is "12/31/9999 12:00 AM", the trigger will be executed because it is in the future.

Go to your trigger and set an end date and time in the future, for example 12/31/9999 12:00 AM.

ADF portal - Specify end date for trigger


































3) ARM template
Next step is to override the endTime property in the ARM template parameter definition. Unlike integration runtime or linked services properties, we need to add this property first. 
  • In the ADF portal go to Manage (same as step 2) and then to ARM template
  • Click on Edit parameter configuration
  • Search for "Microsoft.DataFactory/factories/triggers" and add the endTime property (that is part of recurrence) within typeProperties, set the value to "=:-endTime" and click the OK button. See below how your JSON should look like for the trigger part.
    "Microsoft.DataFactory/factories/triggers": {
        "properties": {
            "pipelines": [
                {
                    "parameters": {
                        "*": "="
                    }
                },
                "pipelineReference.referenceName"
            ],
            "pipeline": {
                "parameters": {
                    "*": "="
                }
            },
            "typeProperties": {
                "scope": "=",
                "recurrence": {
                    "endTime": "=:-endTime"
                }
            }
        }
    },
Now check the ARM template parameters via "Manage - ARM template - Export ARM template" (see step 1) and the result should look like this.
{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "factoryName": {
            "value": "bitools-d-adf-dwh"
        },
        "LS_ABLB_bitools_connectionString": {
            "value": ""
        },
        "Trigger_Master_endTime": {
            "value": "9999-12-31T00:00:00Z"
        }
    }
}
In this case, we have set the default by using "=" in front of the value. Adding a minus - in front of parameter name (endTime) will remove "_properties_typeProperties" from the parameter name. More information here.

Note:
The global parameters are also not set by default. Click here how to include them in the ARM template parameters file as well.

4) Adjust release pipeline
If you are using YAML to publish the changes then the only thing you have to change is the overrideParameters property by adding the new parameter Trigger_Master_endTime and adding either a variable or a hardcoded value. The > behind the property helps you to break the string over multiple lines and keep the YAML code more readable.
          ###################################
          # Deploy ADF Artifact
          ###################################
          - task: AzureResourceManagerTemplateDeployment@3
            displayName: '4 Deploy ADF Artifact'
            inputs:
              deploymentScope: 'Resource Group'
              azureResourceManagerConnection: 'sc_mcacc-adf-devopssp'
              subscriptionId: $(DataFactorySubscriptionId)
              action: 'Create Or Update Resource Group'
              resourceGroupName: $(DataFactoryResourceGroupName)
              location: 'West Europe'
              templateLocation: 'Linked artifact'
              csmFile: '$(Pipeline.Workspace)/ArmTemplatesArtifact/ARMTemplateForFactory.json'
              csmParametersFile: '$(Pipeline.Workspace)/ArmTemplatesArtifact/ARMTemplateParametersForFactory.json'
              overrideParameters: > 
                -factoryName $(DataFactoryName)
                -LS_ABLB_bitools_connectionString $(AzureBlobConnectionString)
                -Trigger_Master_endTime $(AzureDataFactoryTriggerEndTimeActive)
              deploymentMode: 'Incremental'

            env: 
                SYSTEM_ACCESSTOKEN: $(System.AccessToken)
And if you're using the Release pipelines with the ARM template deployment task then you can just go to the Override template parameters property, click on the edit button and replace the value with a new value or a variable from a variable group.

ARM template deployment - Override template parameters


















Conclusion
In this post you learned how to add and override properties of a trigger during deployment via Azure DevOps. This allows you to activate or deactivate a trigger for that environment during deployment using ARM templates without writing any code

In a previous post we showed you how to accomplish this for a Linked Service in combination with Azure Key Vault.

Thursday 18 November 2021

ADF Release - Use script to enable certain Triggers

Case
During deployment of Azure Data Factory (ADF) via Azure DevOps pipelines I want to make sure that a certain trigger is only executed on Production and not on the lower environments like acceptance or test. How can we accomplish that without any manual operations? 
ADF Trigger
























Solution
This is possible with an extra PowerShell step. The standard deployment stages consists of three steps:
  • a pre-deployment script that stops all triggers.
  • the actual deployment
  • a post-deployment script that starts all triggers and cleans up old parts.
You could adjust the standard pre- and post deployment PowerShell script from Microsoft or create an additional PowerShell script if you don't want to mess around with the standard script from Microsoft. 

1) PowerShell
Below that additional script. Feel free to merge it with the standard script. The PowerShell file should be stored in the repository in the \CICD\PowerShell folder (see setup post).
PowerShell file for setting trigger status
























The new PowerShell script has five parameters which will be provided by the YAML pipeline (or release pipeline):
  1. DataFactoryName
    [string] Name of your Data Factory
  2. DataFactoryResourceGroup
    [string] Name of the Resource Group holding your ADF
  3. DataFactorySubscriptionId
    [string] Guid of the Azure Subscription hosting your ADF
  4. DisableAllTriggers
    [boolean] True or false indicating whether all triggers should be disabled (except triggers mentioned in next parameter)
  5. EnabledTriggers
    [string] Comma separated list with triggernames that should be enabled: "trigger1,trigger2"
The script consists of three parts. The first part checks all parameters. If one of them is incorrect then the scripts fails and stops. The second part is the optional disabling of all triggers (except the ones that we need enabled) and the last part of the script checks the list of triggers that should be enabled. If they are still disabled they will be enabled.
param
(
    [parameter(Mandatory = $true)] [String] $DataFactoryName,
    [parameter(Mandatory = $true)] [String] $DataFactoryResourceGroup,
    [parameter(Mandatory = $true)] [String] $DataFactorySubscriptionId,
    [parameter(Mandatory = $false)] [Bool] $DisableAllTriggers = $true,
    [parameter(Mandatory = $true)] [String] $EnabledTriggers # comma separated list
)



##############################################
# Check provided information
##############################################
$ErrorActionPreference = "Stop"

# Setting one subscription on active (fails with non existing)
Write-Host "Checking existance Subscription Id [$($DataFactorySubscriptionId)]."
$Subscription = Get-AzSubscription -SubscriptionId $DataFactorySubscriptionId `
                                   -WarningAction Ignore
Write-Host "- Subscription [$($Subscription.Name)] found."
Set-AzContext -Subscription $DataFactorySubscriptionId `
              -WarningAction Ignore > $null
Write-Host "- Subscription [$($Subscription.Name)] is active."


# Checking whether resource group exists (fails with non existing)
Write-Host "Checking existance Resource Group [$($DataFactoryResourceGroup)]."
Get-AzResourceGroup -Name $DataFactoryResourceGroup > $null
Write-Host "- Resource Group [$($DataFactoryResourceGroup)] found."


# Checking whether provided data factory exists (fails with non existing)
Write-Host "Checking existance Data Factory [$($DataFactoryName)]."
Get-AzDataFactoryV2 -ResourceGroupName $DataFactoryResourceGroup `
                    -Name $DataFactoryName > $null
Write-Host "- Data Factory [$($DataFactoryName)] found."


# Checking provided triggernames, first split into array
$EnabledTriggersArray = $EnabledTriggers.Split(",")
Write-Host "Checking existance of ($($EnabledTriggersArray.Count)) provided triggernames."


# Loop through all provided triggernames
foreach ($EnabledTrigger in $EnabledTriggersArray)
{ 
    # Get Trigger by name
    $CheckTrigger = Get-AzDataFactoryV2Trigger -ResourceGroupName $DataFactoryResourceGroup `
                                               -DataFactoryName $DataFactoryName `
                                               -Name $EnabledTrigger `
                                               -ErrorAction Ignore # To be able to provide more detailed error

    # Check if trigger was found
    if (!$CheckTrigger)
    {
        throw "Trigger $($EnabledTrigger) not found in data dactory $($DataFactoryName) within resource group $($DataFactoryResourceGroup)"
    }
}
Write-Host "- All ($($EnabledTriggersArray.Count)) provided triggernames found in data dactory $($DataFactoryName) within resource group $($DataFactoryResourceGroup)"



##############################################
# Disable triggers
##############################################
# Check if all trigger should be disabled
if ($DisableAllTriggers)
{
    # Get all enabled triggers and stop them (unless they should be enabled)
    Write-Host "Getting all enabled triggers that should be disabled."
    $CurrentTriggers = Get-AzDataFactoryV2Trigger -ResourceGroupName $DataFactoryResourceGroup `
                                                   -DataFactoryName $DataFactoryName `
                       | Where-Object {$_.RuntimeState -ne 'Stopped'} `
                       | Where-Object {$EnabledTriggersArray.Contains($_.Name) -eq $false}

    # Loop through all found triggers
    Write-Host "- Number of triggers to disable: $($CurrentTriggers.Count)."
    foreach ($CurrentTrigger in $CurrentTriggers)
    {
        # Stop trigger
        Write-Host "- Stopping trigger [$($CurrentTrigger.Name)]."
        Stop-AzDataFactoryV2Trigger -ResourceGroupName $DataFactoryResourceGroup -DataFactoryName $DataFactoryName -Name $CurrentTrigger.Name -Force > $null
    }
}



##############################################
# Enable triggers
##############################################
# Loop through provided triggernames and enable them
Write-Host "Enable all ($($EnabledTriggersArray.Count)) provided triggers."
foreach ($EnabledTrigger in $EnabledTriggersArray)
{                   
    # Get trigger details
    $CheckTrigger = Get-AzDataFactoryV2Trigger -ResourceGroupName $DataFactoryResourceGroup `
                                               -DataFactoryName $DataFactoryName `
                                               -Name $EnabledTrigger

    # Check status of trigger
    if ($CheckTrigger.RuntimeState -ne "Started")
    {
        Write-Host "- Trigger [$($EnabledTrigger)] starting"
        Start-AzDataFactoryV2Trigger -ResourceGroupName $DataFactoryResourceGroup `
                                     -DataFactoryName $DataFactoryName `
                                     -Name $EnabledTrigger `
                                     -Force > $null
    }
    else
    {
        Write-Host "- Trigger [$($EnabledTrigger)] already started"
    }
}

2) YAML Pipeline
You can now extend the existing YAML pipeline with an extra step. Make sure that all parameters for this script are available as variables in the variable group (under Pipelines, Library) and make sure to pass them to the second YAML pipeline as parameters. If you followed the previous blogs then you only need to add EnabledTriggers as variable and a YAML parameter.
          ###################################
          # Enable certain triggers and disable rest
          ###################################
          - task: AzurePowerShell@5
            displayName: '6 Enable certain triggers and disable rest'
            inputs:
              azureSubscription: 'sc_adf-devopssp'
              pwsh: true
              azurePowerShellVersion: LatestVersion
              scriptType: filePath
              scriptPath: '$(Pipeline.Workspace)\s\CICD\powershell\SetTriggers.ps1'
              scriptArguments: > # Use this to avoid newline characters in multiline string
                -DataFactoryName $(DataFactoryName)
                -DataFactoryResourceGroup $(DataFactoryResourceGroupName)
                -DataFactorySubscriptionId $(DataFactorySubscriptionId)
                -DisableAllTriggers $true
                -EnabledTriggers $(EnabledTriggers) # format: "prd_daily_4am,prd_daily_1pm"
The result of running the pipeline










Conclusion

In this post you learned how to enable only certain triggers for a specific environment. This makes it easy to generate a trigger in development for the production environment. The downside (for some) is ofcourse that you get an extra piece of code to maintain. In a next post we will show that you can also accomplish this without writing code via the ARM template. However the trigger property runtimeState cannot be set via the ARM template, so a workaroumd is necessary for the nocode variant.


Sunday 14 November 2021

ADF Release - Update Linked Service while deploying

Case
I'm deploying Azure Data Factory via DevOps pipelines through my DTAP environment. During the deployment I want to change the URL of the Linked Service from Azure Key Vault to point it to the Key Vault of that specific environment. How do I change that Linked Service in DevOps?
ADF Linked Service






















Solution
This is possible by changing the ARM template parameter definition which on its turn will switch certain properties into overridable parameters during deployment. There is one downside: you cannot create a parameter for one specific Linked Service, because it will work for all Linked Services with that same property. However you can narrow it down to one particular type of Linked Service (for all Key Vaults in this example). 

1) Check Linked Service
For this example we will override the URL of the Azure Key Vault Linked Service, but first we need to find the actual property name that we want to override.
  • In ADF Studio go to Manage (toolbox icon in left menu) and then to Linked Services.
  • Now find your Key Vault Linked Service and hover your mouse over .it and click on the code icon {}.
  • Now check which property identifies a specific Key Vault. It should be within the typeProperties tag. In this case the baseUrl property contains a URL that points to one specific Key Vault.
baseURL property











Also notice the type property which will be used further on: AzureKeyVault


2) ARM template
Next step is to make this property overridable in the ARM template parameter definition (arm-template-parameters-definition.json). First the easiest way:
  • Under the same Manage menu item as step 1 go to ARM template
  • Click on Edit parameter configuration
  • Now find the baseUrl property within the Linked Services tag and change its value from "=" to "-" and then click on the OK button
Making baseUrl overridable
























This will create a new parameter with the name: [LinkedServiceName]_properties_typeProperties_baseUrl

As mentioned before, this will now work for all Linked Services that have a (filled) property called baseUrl. A bit nicer is to instead create a Key Vault specific parameter by adding a piece JSON code below the general tag with the *. The name 'AzureKeyVault' from the code below can be found in the code of step 1.
 
        "AzureKeyVault": {
            "properties": {
                "typeProperties": {
                    "baseUrl": "-"
                }
            }
        },
























This will result in the same (long) parameter name, but now only for Linked Services pointing to Azure Key Vault. We can shorten that very long parameter name by adding -:-BaseUrl where the colon : is the separator for the next part of the property. This is the name of the parameter. Adding a minus - in front of that name will remove _properties_typeProperties from the parametername and shorten it to:
[LinkedServiceName]_baseUrl 
        "AzureKeyVault": {
            "properties": {
                "typeProperties": {
                    "baseUrl": "-:-BaseUrl"
                }
            }
        }
This is much nicer. More info about this can be found in the documentation.

3) Adjust release pipeline
If you are using YAML to publish the changes then the only thing you have to change is the overrideParameters property by adding the new parameter ls_kv_bitools_baseUrl and adding either a variable or a hardcoded value. The > behind the property helps you to break the string over multiple lines and keep the YAML code more readable.
          ###################################
          # Deploy ADF Artifact
          ###################################
          - task: AzureResourceManagerTemplateDeployment@3
            displayName: '4 Deploy ADF Artifact'
            inputs:
              deploymentScope: 'Resource Group'
              azureResourceManagerConnection: 'sc_mcacc-adf-devopssp'
              subscriptionId: $(DataFactorySubscriptionId)
              action: 'Create Or Update Resource Group'
              resourceGroupName: $(DataFactoryResourceGroupName)
              location: 'West Europe'
              templateLocation: 'Linked artifact'
              csmFile: '$(Pipeline.Workspace)/ArmTemplatesArtifact/ARMTemplateForFactory.json'
              csmParametersFile: '$(Pipeline.Workspace)/ArmTemplatesArtifact/ARMTemplateParametersForFactory.json'
              overrideParameters: > 
                -factoryName $(DataFactoryName) 
                -ls_kv_bitools_baseUrl "https://bitools-prd.vault.azure.net/"
              deploymentMode: 'Incremental'

            env: 
                SYSTEM_ACCESSTOKEN: $(System.AccessToken)
If you are not sure about which parameternames you can use in the YAML, then you can lookup that name by exporting the ARM template under ARM template. Then check arm_template.json or arm_template_parameters.json
export template to find parametername













And if you're using the Release pipelines with the ARM template deployment task then you can just go to the Override template parameters property, click on the edit button and replace the value with a new value or a variable from a variable group.
ARM template deployment - Override template parameters

















Conclusion
In this post you learned how to override properties of a Linked Service during deployment via Azure DevOps. This allows you to point your Linked Service to the correct service for that specific environment. The most likely candidate for this is probably the Linked Service pointing to Azure Key Vault where you store all other connection details.

In a previous post we also showed you how to change Global Parameters during deployment and in a next post we will show you how to change triggers during deployment because you probably don't want to use the same triggers on development, test, acceptance and production.



Thursday 4 November 2021

ADF Release - Set global params during deployment

Case
I'm using global parameters in my Azure Data Factory project and I want to change their values during deployment through my DTAP environments. How do I do that?
ADF Global Parameters


















Solution
In this post we are only focusing on the deployment part of the DevOps YAML pipeline. If you do not have a pipeline yet then please read the complete story of configuring the Development ADF and releasing it to other Factories first.

1) Include in ARM template
First make sure that in the pane of the Global Parameters the checkbox "Include in ARM template" is checked. This will add the parameters to the ARM template which will make them available as a parameters during deployment.
Check Include in ARM template













2) Getting name of parameter
The global parameter will get a slightly different name in the ARM template. For example: myGlobParam becomes dataFactory_properties_globalParameters_myGlobParam_value. You can check that name by exporting the ARM template under ARM template. Then check arm_template.json or arm_template_parameters.json
Find the parameter name










An other option is to hit the publish button and then check the (not used) adf_publish branch in the repository.
Find the parameter name













3) Adjust release pipeline
If you are using YAML to publish the changes then the only thing you have to change is the overrideParameters property by adding the new parameter dataFactory_properties_globalParameters_myGlobParam_value and adding either a variable or a hardcoded value. The > behind the property helps you to break the string over multiple lines and keep the YAML code more readable.
          ###################################
          # Deploy ADF Artifact
          ###################################
          - task: AzureResourceManagerTemplateDeployment@3
            displayName: '4 Deploy ADF Artifact'
            inputs:
              deploymentScope: 'Resource Group'
              azureResourceManagerConnection: 'sc_mcacc-adf-devopssp'
              subscriptionId: $(DataFactorySubscriptionId)
              action: 'Create Or Update Resource Group'
              resourceGroupName: $(DataFactoryResourceGroupName)
              location: 'West Europe'
              templateLocation: 'Linked artifact'
              csmFile: '$(Pipeline.Workspace)/ArmTemplatesArtifact/ARMTemplateForFactory.json'
              csmParametersFile: '$(Pipeline.Workspace)/ArmTemplatesArtifact/ARMTemplateParametersForFactory.json'
              overrideParameters: > 
                -factoryName $(DataFactoryName) 
                -dataFactory_properties_globalParameters_myGlobParam_value "Test123"
              deploymentMode: 'Incremental'

            env: 
                SYSTEM_ACCESSTOKEN: $(System.AccessToken)
After deployment you can see the new parameter value











And if you're using the Release pipelines with the ARM template deployment task then you can just go to the Override template parameters property, click on the edit button and replace the value with a new value or a variable from a variable group.
ARM template deployment - Override template parameters





















Conclusion
In this post you learned how to use Global Parameters from ADF as ARM template parameters for Azure Data Factory. This allows you to use different settings in the various factories in you DTAP environments. In a next post we will show you how to override properties from for example a Linked Service or a Trigger that should get a different value in acceptance or production.