Microsoft BI Tools: SYNAPSE

Showing posts with label SYNAPSE. Show all posts

Sunday, 19 March 2023

Synapse - Using Spark SQL to time travel Delta Tables

Case
In a previous blog post you showed how to create and query Delta Tables with PySpark for a Lake House, however most Data Warehouse people are more familiar with the SQL language. How can you query a Delta Table with the good old SQL language?

Using Spark SQL to time travel Delta Tables

Solution
In that previous blog post we showed you that you can query the Delta Tables in for example a SQL Serverless pool by creating External Tables on those Delta Tables. This allows you to use TSQL to query Delta Tables, but it doesn't allow you to use time travel. You always get the latest version of the data.

External Tables on Delta in Serverless SQL Pool

However we can use Synapse Notebooks with Spark SQL as a language which is very similar to TSQL to query Delta Tables. This allows you to time travel the data in a familiar language.

1) Add Delta Table to Lake Database

For easily querying Delta Tables you first need make the Delta Tables visible in Synapse by adding them to the Lake Database. We explained this in the previous blog post.

Adding Delta Table to Lake Database

Once the Delta Table is available in the Lake Database you can query it like a regular table. By default you will see the latest version of the data.

%%sql
SELECT * FROM silver.cities

The alternative is to use the entire path:

%%sql
SELECT * FROM delta.`abfss://mysource@mydatalake.dfs.core.windows.net/silver/cities`

2) Show historical versions

You can check which historical versions are available with the DESCRIBE HISTORY command.

%%sql
DESCRIBE HISTORY silver.cities

Show versions of the Delta Table

Besides showing the history you can also check where the Delta Table is stored in your Data Lake with the DESCRIBE EXTENDED command. It will give you various details like the location of the Delta Table.

See details of Delta Table

3) Show specifict version by version number

With the DESCRIBE HISTORY command you get a table with various versions of your table. The fist column shows the version number that starts with 0 for the initial version of the table.

When you query a Delta Table you can add VERSION AS OF X behind the query where you replace the X by the version number. In this example we take version 2 (the third version of the table).

%%sql
SELECT * FROM silver.cities VERSION AS OF 2

Showing version 2 of the Delta Table

4) Show specifict version by date

Time traveling with a specific version number is cumbersome because you first need to determine the version you need. Lucily you can also get a version that was active on a specific date by adding TIMESTAMP AS OF "2022-01-01" behind the query.

%%sql
SELECT * FROM silver.cities TIMESTAMP AS OF "2022-01-01"

Showing version of a specific date

Conclusions

In this post you learned how to time travel a Delta Table with Spark SQL. The same options as with PySpark, but for some people just a little bit more readable. In a next post we will discus Change Data Feed to get data changes between versions.

Synapse - Add existing Delta Table to Lake Database

Case
How can I query my lake house files and tables in Synapse without specifying the entire data lake path for each file or table?

Querying your Lake House in Synapse

Solution
In the Data tab (left menu) of the Synapse Workspace you can create a Lake database and then add your files and tables to it. By default there is already a Lake Database present called 'default', but it won't be visible until you add tables to it or add other databases.

1) Create database

Let's create a new database. We will create one for bronze, silver and gold. For the files we could create a separate database called Ingest for easy querying. In the documentation they will either mention a CREATE DATABASE or a CREATE SCHEMA command. They are the same thing.

Create a new notebook with either SPARK SQL or PySpark as language.
Attach it to a spark pool
In the first Code block add the code below and execute it. The IF NOT EXISTS and the COMMENT are optional.

%%sql
CREATE DATABASE IF NOT EXISTS Ingest COMMENT 'Raw files';
CREATE DATABASE IF NOT EXISTS Bronze COMMENT 'Raw Layer';
CREATE DATABASE IF NOT EXISTS Silver COMMENT 'Validated Layer';
CREATE DATABASE IF NOT EXISTS Gold COMMENT 'Enriched Layer';

%%pyspark
spark.sql(f'CREATE DATABASE IF NOT EXISTS Ingest COMMENT \'Raw files\';');
spark.sql(f'CREATE DATABASE IF NOT EXISTS Bronze COMMENT \'Raw Layer\';');
spark.sql(f'CREATE DATABASE IF NOT EXISTS Silver COMMENT \'Validated Layer\';');
spark.sql(f'CREATE DATABASE IF NOT EXISTS Gold COMMENT \'Enriched Layer\';');

SPARK SQL Code block

Now go to the Data and then you will see a Lake database list with 4 databases (default + the three you created)

Synapse Lake databases

2) Create table on Parquet file

For the raw files we will create a table base on the parquet (or flat) files from your ingestion from the source into the data lake. If the file has timestamp in the name you could even use a wildcard in the path.

Go back to your notebook and create a second Code block
Add the code below and execute it. It will create a table in the Ingest Lake database
Then go to the Data tab and then unfold the Ingest database and then its tables (if you are to fast then you might have to refresh the tables list.

%%sql
CREATE TABLE IF NOT EXISTS Ingest.Cities
USING PARQUET
Location 'abfss://mysource@mydatalake.dfs.core.windows.net/Ingestfolder/Cities*.parquet'

%%pyspark
spark.sql(f'CREATE TABLE IF NOT EXISTS Ingest.{table_name} USING PARQUET LOCATION \'{parquet_path}\'')

Lake table based on a Parquet file from the data lake

3) Create table on Delta Table

For the Bronze (or Silver or Gold) layer we will create a table based on an existing Delta Table from the data lake.

Go back to your notebook and create a third Code block
Add the code below and execute it. It will create a table in the Bronze Lake database
Then go to the Data tab and then unfold the Bronze (or Silver or Gold) database and then its tables (if you are to fast then you might have to refresh the tables list.

%%sql
CREATE TABLE IF NOT EXISTS Bronze.Cities
USING DELTA
Location 'abfss://mysource@mydatalake.dfs.core.windows.net/Bronze/Cities'

$$pyspark
spark.sql(f'CREATE TABLE IF NOT EXISTS Bronze.{table_name} USING DELTA LOCATION \'{delta_table_path}\'')

Lake table based on a Delta Table from the data lake

4) Query the new tables

Now you can query the files and delta tables like a regular database table in either a notebook with a Spark pool running or a SQL script. There is one difference between those two. In the SQL Script you get the default dbo schema between the database name and the table name. This in mandatory for the SQL script, but not allowed in a notebook.

Query in notenbook with SPARK SQL

Query in a SQL Script

5) Create views on new tables

You can also create views on those new tables where you already add some business logic that is usefull for other colleagues working with this data or you can create views for dimensions and facts. You could even create views for dimensions and facts on your silver table and then serve them to Power BI

Create view on new tables

Conclusion

In this post you learned how to make your data lake files and delta lake tables easier to query. A few extra steps for each table, but after that querying is much easier. In a future post we will show you how to do this directy during the INGEST and DELTA steps. Then you don't have any manual steps for each new file or table.

In the next post we will show you how to use timetravel on those Delta Tables with SPARK SQL. This post is the start position of that post.

Special thanks to colleague Martijn Broeks for helping out.

Sunday, 5 March 2023

Synapse snack - Get child pipeline value to parent

Case

We can pass values from the parent pipeline to the child pipeline via parameters, but how do we get return values from the child pipeline to the parent pipeline?

Pipeline return value

Solution
In the past we used a Webhook activity to call a child pipeline via the Rest API and then we can use a Web activity in the child pipeline to return a value via the callbackuri. This workaround was way to complex and you end up with to different executions that are not related to each other.

Last month Microsoft introduced for both Synapse Worksapce and Azure Data Factory, the new Pipeline return value option in the Set variable activity. This allows you to return one or more HARDCODED values to the child pipeline.

1) Child pipeline - Set Variable

First we need to create a child pipe that we will be calling from a parant pipeline in the next step. The only required activity is the Set Variable activity

So first create a new pipeline. We called it PL_Child
Add a Set Variable activity to the canvas of your child pipeline. Ours is called Return Value
In the Settings tab of the activity set the Variable type to Pipeline return value (preview)
Now you can create a new string variable with a hardcoded return value. For expressions you need to change the type to for example Expression. Note that compared to pipeline variables you have way more types to choose from.

Return value

Note that this is not an existing pipeline variable so other activities cannot change its value. Also note that the Value field cannot be overruled with an expression. So only hardcode values.

2) Parent pipeline - Execute Pipeline

Now we need the parent pipeline that will be calling the child pipeline to get the return value.

Create a new pipeline. Ours is called PL_Parent
Add a Execute Pipeline activity to the canvas of your new pipeline
Set it to execute the Child pipeline of step 1
Make sure the Wait on completion is checked!

Execute Pipeline to retrieve return value

Note that if you execute this parent pipeline that you won't see the return value in the Output window. This means we need an other activity to see that return value.

Output with no visible return value

3) Get return value

To get the return value you can use an expression with a pipelineReturnValue after the output of your Execute Pipeline activity followed by the name of your return value: @activity('EPL_GetAnswer').output.pipelineReturnValue.MyAnswer

In this example we will store the return value in a pipeline variable. Note that if you used anything else than String, Boolean or Array that you need to add a type conversion in the expression:

@string(activity('EPL_GetAnswer').output.pipelineReturnValue.MyAnswer)

Read return value from output

Output showing return value

Note that you cannot have multiple Set Variable activities running and all returning values (even if they use different names). Then it will only return one of those causing an error in your expression. You can have mulitple Set Variable activities returning values if you put them in for example an if construction so that only one will run.

Conclusions

In this little snack you learn about the new preview(!) feature in ADF and Synapase to return values from the child to the parent pipeline. A very nice new feature that we were waiting for al long time, it could use some small improvements like the expressions (why do we need an expression type?) and an option to see the output of the Execute pipeline activity.

Monday, 13 February 2023

Show dataset parameters in Azure Synapse Dataflow

Case
I have a parameterized dataset. The parameter is showing in the pipeline's Copy Data Activity, but it's not showing in the Synappse (or ADF) dataflow under source or sink. When going the the Data preview it shows me an error message: No value provided for Parameter 'MyFileName'.

No value provided for Parameter 'abc'

Solution
You added a parameter to your dataset so you can for example use it in a foreach construction in your pipeline.

Dataset with parameter

In your pipeline's Copy Data Activity you can see the dataset parameter as soon as you select your dataset with the parameter.

Parameter showing in Copy Data Activity

However when you select that same dataset as a source (or sink) you won't see the parameter appearing in the editor. On this screen there is no way to provide the parameter and when you debug the source you will get an error stating that you need to provide a value for that parameter: No value provided for Parameter 'MyFileName'.

Dataset selected, but no parameter

The first option is to provide a default value for your dataset parameter in the dataset itself. Now you won't get that error. This is perhaps in some cases suitable, but for most cases it isn't.

Default value for parameter in dataset

Now go to your pipeline and add a Dataflow Activity for your Dataflow. You will see the Parameter appear in the settings tab (not in the Parameters tab which is for Dataflow parameters only). This is handy for when you when want to debug the Pipeline and Dataflow at the same time, but not for when you just want to debug your Dataflow to see the data preview.

Parameter is back again

Now go to your Dataflow and click on Debug Settings. Whitin the Debug Settings go to the Parameters tab. Find your source under Dataset parameters and provide a value for debugging.

Provide Value for parameter in Dataflow Debug Settings

Go to your source (or sink) in the Dataflow and then go to Data preview to see the actual data.

There is data in our preview

Conclusion

In this post you learned how to debug your Dataflow when using a parameterized Dataset. Compared to other parts of Synapse (or ADF) it would probably make more sense to first create a Dataflow Parameter (with a default value for debugging). Then show the Dataset Parameter it in the Source (or Sink) settings page where you override it with the Dataflow Parameter.

Saturday, 31 December 2022

Cleanup Synapse before deployment

Case
We are using the Synapse workspace deployment add on for deploying Synapse to the next environment, but when I remove a pipeline/dataset/linked service/trigger in development it doesn't get deleted in Test, Acceptance or Production. There is a pre- and post-deployment script for Data Factory by Microsoft, but where is the Synapse version of it?

Clean Synapse Workspace before deployment

Update: use DeleteArtifactsNotInTemplate: true in deployment task to avoid powershell

Solution
The deployment add on just deploys a new version over an existing version. Therefore deleted parts will remain in your workspace. For most parts this is ugly and annoying, but if obsoleet triggers are still executing pipelines it could screwup your ETL proces.

Below you will find a first version of a Powershell script that first removes all pipelines, datasets, linked services and triggers before deploying a new version of your Synapse workspace from the repository.

Note: Pipelines that are called by other pipelines can't be deleted. So you first need to delete the parent pipeline before you can delete the child pipeline. The scripts skips those child pipelines and continous with the rest. After this first delete iteration a lot of parent pipelines wont exist any more and allow you to remove the child pipelines in a second iteration. This is done in a loop and that loops stops after a100 iterations. So don't create a monstrous tree of pipelines calling each other (and especially don't create loops of pipelines calling each other). The same trick is used for Linked Services. If you have for example a Key Vault Linked Service that is used in an other Linked Service then you first need to delete that second Linked Service before you can delete the Key Vault Linked Service.

param (

   [Parameter (Mandatory = $true, HelpMessage = 'Synapse name')]
   [ValidateNotNullOrEmpty()]
   [string] $WorkspaceName,
   
   [Parameter (Mandatory = $true, HelpMessage = 'Resourcegroup name')]
   [ValidateNotNullOrEmpty()]
   [string] $ResourceGroupName 
)

[string] $WorkspaceDefaultSqlServer = "$($WorkspaceName)-WorkspaceDefaultSqlServer"
[string] $WorkspaceDefaultSqlStorage = "$($WorkspaceName)-WorkspaceDefaultStorage"


#######################################################
# 1) Checking for resource locks and removing them
#######################################################
Write-Output "========================================"
Write-Output "1) Getting resource locks"
# Getting all locks on the Azure Synapse Workspace
$lock = Get-AzResourceLock -ResourceGroupName $ResourceGroupName -ResourceName $WorkspaceName -ResourceType "Microsoft.Synapse/workspaces"

# Looping through all locks to remove them one by one
Write-Output "========================================"
Write-Output "Remove resource locks"
if($null -ne $lock)
{
    $lock | ForEach-Object -process {
        Write-Output "Removing Lock Id: $($lock.LockId)"
        # Remove lock
        Remove-AzResourceLock -LockId $_.LockId -Force
    }
}


#######################################################
# 2) Stopping and removing all triggers
#######################################################
Write-Output "========================================"
Write-Output "2) Remove triggers"
# Getting all triggers from Synapse
$triggers = Get-AzSynapseTrigger -WorkspaceName $WorkspaceName
Write-Output "Found $($triggers.Count) triggers"

# Stopping all triggers before deleting them
$triggers | ForEach-Object -process { 
    Write-Output "Stopping trigger $($_.name)"
    try {
        # Trying to stop each trigger
        Stop-AzSynapseTrigger -WorkspaceName $WorkspaceName -Name $($_.name) -ErrorAction Stop
    }
    catch {
        if ($_.Exception.Message -eq "{}") {
            Write-Output "Trigger stopped"
           # $_.Exception
        }
        else {
            Write-Output "Throw"
            Throw $_
        }
    }
    # Remove trigger
    Remove-AzSynapseTrigger -Name $_.name -WorkspaceName $WorkspaceName -Force
}


#######################################################
# 3) Removing all pipelines
#######################################################
Write-Output "========================================" 
Write-Output "3) Remove pipelines"
# Getting all pipelines from Synapse
$pipelines = Get-AzSynapsePipeline -WorkspaceName $WorkspaceName | Sort-Object -Property id
Write-Output "Found $($pipelines.Count) pipelines"

# Trying to delete all pipelines. If a pipeline is still referenced
# by an other pipeline it will continue to remove other pipelines 
# before trying to remove it again... max 100 times. So don't create
# chains of pipelines that are too long
[int] $depthCount = 0
while ($pipelines.Count -gt 0 -and $depthCount -lt 100)
{
    Write-Output "$($pipelines.Count) pipelines left"
    $pipelines | ForEach-Object -process { 
        Write-Output "Trying to delete pipeline $($_.name)"
        Remove-AzSynapsePipeline -Name $_.name -WorkspaceName $WorkspaceName -Force -ErrorAction SilentlyContinue
    }
    Start-Sleep 2 
    $depthCount += 1
    $pipelines = Get-AzSynapsePipeline -WorkspaceName $WorkspaceName
}
Write-Output "Depthcount: $depthCount"
if ($depthCount -eq 100)
{
    throw "Depthcount is to high!"
}


#######################################################
# 4) Removing all notebooks
#######################################################
Write-Output "========================================"
Write-Output "4) Remove notebooks"
# Getting all notebooks from Synapse
$notebooks = Get-AzSynapseNotebook -WorkspaceName $WorkspaceName
Write-Output "Found $($notebooks.Count) notebooks"

# Loop through all notebooks to delete them
$notebooks | ForEach-Object -process {
    Write-Output "Deleting notebooks $($_.Name)"
    Remove-AzSynapseNotebook -Name $($_.Name) -WorkspaceName $WorkspaceName -Force
}


#######################################################
# 5) Removing all SQL scripts
#######################################################
Write-Output "========================================"
Write-Output "5) Remove SQL scripts"
# Getting all scripts from Synapse
$sqlscripts = Get-AzSynapseSqlScript -WorkspaceName $WorkspaceName
Write-Output "Found $($sqlscripts.count) SQL-scripts"

# Loop through all SQL scripts to delete them
$sqlscripts | ForEach-Object -Process {
    Write-Output "Deleting SQL-script $($_.Name)"
    Remove-AzSynapseSqlScript -Name $($_.Name) -WorkspaceName $WorkspaceName -Force
}


#######################################################
# 6) Removing all datasets
#######################################################
Write-Output "========================================"
Write-Output "6) Remove datasets"
# Getting all datasets from Synapse
$datasets = Get-AzSynapseDataset -WorkspaceName $WorkspaceName
Write-Output "Found $($datasets.Count) datasets"

# Loop through all datasets to delete them
$datasets | ForEach-Object -process { 
    Write-Output "Deleting dataset $($_.name)"
    Remove-AzSynapseDataset -Name $_.name -WorkspaceName $WorkspaceName -Force
}


#######################################################
# 7) Removing all linked services
#######################################################
Write-Output "========================================"
Write-Output "7) Collecting Linked services"
# Getting all linked services from Synapse, except the two default ones
$lservices = Get-AzSynapseLinkedService -WorkspaceName $WorkspaceName | Where-Object {($_.Name -ne $WorkspaceDefaultSqlServer -and  $_.Name -ne $WorkspaceDefaultSqlStorage) } 
Write-Output "Found $($lservices.Count) linked services"

# Trying to delete all linked services. If a linked service is still
# referenced by an other linked service it will continue to remove 
# other linked services before trying to remove it again... 
# max 100 times. Example: KeyVault linked services
$depthCount = 0
while ($lservices.Count -gt 0 -and $depthCount -lt 100)
{
    Write-Output "$($lservices.Count) linked services left"
    $lservices | ForEach-Object -process { 
        Write-Output "Trying to delete linked service $($_.name)"
        Remove-AzSynapseLinkedService -Name $_.name -WorkspaceName $WorkspaceName -Force -ErrorAction Continue
    }

    Start-Sleep 2 
    $depthCount += 1
    $lservices = Get-AzSynapseLinkedService -WorkspaceName $WorkspaceName | Where-Object {($_.Name -ne $WorkspaceDefaultSqlServer -and  $_.Name -ne $WorkspaceDefaultSqlStorage) }
}
Write-Output "Depthcount: $depthCount"
if ($depthCount -eq 100)
{
    throw "Depthcount is to high!"
}
Write-Output "========================================"

You need to store this Powershell file in the repository as ClearSynapse.ps1. In this case we created a CICD folder in the root of the repository and within that folder we created a Powershell subfolder for all our PowerShell files. Then you can call this script in your YAML pipeline just before you do the deployment part. Make sure your service connection (Service Principal) has enough rights within the workspace. For the first post we used the Synapse Artifact Publisher role to minimize access. For running this script your Service Principal needs more: Synapse Administrator.

           ###################################
            # 4 Cleanup Synapse
            ###################################
            - task: AzurePowerShell@5
              displayName: '4 Cleanup Synapse'
              inputs:
                azureSubscription: ${{ parameters.ServiceConnection }}
                scriptType: filePath
                scriptPath: $(Pipeline.Workspace)\s\CICD\Powershell\ClearSynapse.ps1
                scriptArguments:
                  -WorkspaceName ${{ parameters.TargetWorkspaceName }} `
                  -ResourceGroupName ${{ parameters.ResourceGroupName }} `
                azurePowerShellVersion: latestVersion
                pwsh: true
                
           ###################################
            # 5 Validate and Deploy Synapse
            ###################################
            - task: Synapse workspace deployment@2
              displayName: '5 Validate and deploy Synapse'
              inputs:
                operation: validateDeploy
                ArtifactsFolder: '$(Pipeline.Workspace)/SynapseArtifact'
                azureSubscription: ${{ parameters.ServiceConnection }}
                ResourceGroupName: ${{ parameters.ResourceGroupName }}
                TargetWorkspaceName: ${{ parameters.TargetWorkspaceName }}
                OverrideArmParameters: '
                  -LS_AKV_MyKeyVault_properties_typeProperties_baseUrl                  https://${{ parameters.KeyVaultName }}.vault.azure.net/
                  '

Conclusion
In this post you learned how to cleanup Synapse with a little PowerShell script. This scripts works perfectly, but is a litte rough by just deleting all basic parts of your workspace (pipelines, datasets, linked services and triggers). A next / nicer version it will just delete everything that is in the Synapse Workspace but isn't in the repositorty (after the deployment).

Deleting stuff before deploying new stuff also makes is almost mandatory to use at least 3 environments because when your deployment fails you are left with an almost empty Synapse workspace. So an extra enviroment between development and production will prevent most deployment screwups.

Thank you Walter ter Maten for improving the script with the delete iterations.

Monday, 7 November 2022

Create Data Lake containers and folders via DevOps

Case
I need a process to create Azure Data Lake containers throughout my DTAP environment of my Azure Data Platform. Manually is not an option because we want to minimize owner and contributor access to the Data Lake of acceptance and production, but Synapse and Data Factory don't have a standard activity to create ADL containers. How to automatically create Azure Data Lake Containers (and folders) ?

Storage Account (datalake) containers

Solution
An option is to use a PowerShell script that is executed by the Custom activity in combination with an Azure Batch service. Or an Azure Automation runbook with the same PowerShell script that is executed by a Web(hook) activity.

However since you probably don't need create new containers during every (ADF/Synapse) pipeline run, we suggest to do this via an Azure Devops Pipeline as part of your CICD proces with the same PowerShell script. You could either create a separte CICD pipeline for it or integrate it in your Synapse or ADF pipeline.

The example below creates containers and optionaly also folders and subfolders within these container. Synapse and Data Factory will create folders with forexample the Copy Data activity

1) Repos folder structure

For this example we use a CICD folder in the repos with subfolders for PowerShell, YAML and Json.

Repos folder structure

2) JSON config

Because we don't want to hardcode the containers and folders we use a JSON file as input for the PowerShell script. This JSON file is stored within the JSON folder of the DevOps Repository. We use the same JSON file for the entire environment, but you can ofcourse create a separate file for each environment if you need for example different containers on production. Our file is called config_storage_account.json

The folder array in this example is optional and when left empty no folders will be created. You can create subfolders within folders by separating them with a forwardslash.

{
"containers":   {
                "dataplatform":["folder1","folder2/subfolder1","folder2/subfolder2"]
                , "SourceX":["Actual","History"]
                , "SourceY":["Actual","History"]
                , "SourceZ":[]
                }
}

3) PowerShell code
The PowerShell script called SetStorageAccounts.ps1 is stored in the PowerShell folder and contains three parameters:

ResourceGroupName - The name of the resource group where the storage account is located.
StorageAccountName - The name the storage account
JsonLocation - The location of the json config file in the repos (see previous step)

It checks the existance of both the config file and the storage account. Then first loop through the containers from the config and within the container loop it loops through the folders of that specific container. For container names and folderpaths it does some small corrections for often made mistakes.

Note that the script will not delete containers and folders (or set authorizations to them). This is of course possible, but make sure to test this very thoroughly and even with testing a human error in configuring the config file is easy to make and could cause lots of data lose!

# This PowerShell will create the containers provided in the JSON file
# It does not delete of update containers and folders or set authorizations
param (
    [Parameter (Mandatory = $true, HelpMessage = 'Resource group name of the storage account.')]
    [ValidateNotNullOrEmpty()]
    [string] $ResourceGroupName,

    [Parameter (Mandatory = $true, HelpMessage = 'Storage account name.')]
    [ValidateNotNullOrEmpty()]
    [string] $StorageAccountName,

    [Parameter (Mandatory = $true, HelpMessage = 'Location of config_storage_account.json on agent.')]
    [ValidateNotNullOrEmpty()]
    [string] $JsonLocation
 )

# Combine path and file name for JSON file. The file name is hardcoded and the
# same for each environment. Create an extra parameters for the filename if
# you need different files/configurations per environment.
$path = Join-Path -Path $JsonLocation -ChildPath "config_storage_account.json"
Write-output "Extracting containernames from $($path)"


# Check existance of file path on the agent
if (Test-Path $path -PathType leaf) {
    
    # Get all container objects from JSON file
    $Config = Get-Content -Raw -Path $path | ConvertFrom-Json

    # Create containers array for looping
    $Config | ForEach-Object { 
        $Containers = $($_.containers) | Get-Member -MemberType NoteProperty | Select-Object -ExpandProperty Name
    }
    
    # Check Storage Account existance and get the context of it
    $StorageCheck = Get-AzStorageAccount -ResourceGroupName $ResourceGroupName -Name $StorageAccountName -ErrorAction SilentlyContinue        
    $context = $StorageCheck.Context

    # If Storage Account found
    if ($StorageCheck) {
        # Storage Account found
        Write-output "Storage account $($StorageAccountName) found"

        # Loop through container array and create containers if the don't exist
        foreach ($container in $containers) {
            # First a little cleanup of the container

            # 1) Change to lowercase
            $container = $container.ToLower()

            # 2) Trim accidental spaces
            $container = $container.Trim()


            # Check if container already exists
            Write-output "Checking existence of container $($container)"
            $ContainerCheck = Get-AzStorageContainer -Context $context -Name $container -ErrorAction SilentlyContinue 

            # If container exists
            if ($ContainerCheck) {
                Write-Output "Container $($container) already exists"
            }
            else {
                Write-Output "Creating container $($container)"
                New-AzStorageContainer -Name $container -Context $context | Out-Null
            }

            # Get container folders from JSON
            Write-Output "Retrieving folders from config"
            $folders = $Config.containers.$container
            Write-Output "Found $($folders.Count) folders in config for container $($container)"

            # Loop through container folders
            foreach ($folder in $folders) {
                # First a little cleanup of the folders

                # 1) Replace backslashes by a forward slash
                $path = $folder.Replace("\","/")

                # 3) Remove unwanted spaces
                $path = $path.Trim()
                $path = $path.Replace("/ ","/")
                $path = $path.Replace(" /","/")

                # 3) Check if path ends with a forward slash
                if (!$path.EndsWith("/")) {
                    $path = $path + "/"
                }
                    
                # Check if folder path exists
                $FolderCheck = Get-AzDataLakeGen2Item -FileSystem $container -Context $context -Path $path  -ErrorAction SilentlyContinue
                if ($FolderCheck) {
                    Write-Output "Path $($folder) exists in container $($container)"
                } else {
                    New-AzDataLakeGen2Item -Context $context -FileSystem $container -Path $path -Directory | Out-Null
                    Write-Output "Path $($folder) created in container $($container)"
                }
            }

        }
    } else {
        # Provided storage account not corrrect
        Write-Output "Storageaccount: $($StorageAccountName) not available, containers not setup."
    }
} else {
    # Path to JSON file incorrect
    Write-output "File $($path) not found, containers not setup."
}

4) YAML file.

If you integrate this in your existing Data Factory or Synapse YAML pipeline then you only need to add one PowerShell step. Make sure you have a checkout step to copy the config and powershell file from the repository to the agent. You may also want to add a (temporary) treeview step to check the paths on your agent. This makes it easier to configure paths within your YAML code.

parameters:
  - name: SerCon
    displayName: Service Connection
    type: string
  - name: Env
    displayName: Environment
    type: string
    values: 
    - DEV
    - ACC
    - PRD
  - name: ResourceGroupName
    displayName:
    type: string
  - name: StorageAccountName
    displayName:
    type: string

jobs:
    - deployment: deploymentjob${{ parameters.Env }}
      displayName: Deployment Job ${{ parameters.Env }} 
      environment: Deploy to ${{ parameters.Env }}

      strategy:
        runOnce:
          deploy:
            steps:
            ###################################
            # 1 Check out repository to agent
            ###################################
            - checkout: self
              displayName: '1 Retrieve Repository'
              clean: true 

            ###################################
            # 3 Show environment and treeview
            ###################################
            - powershell: |
                Write-Output "Deploying Synapse in the ${{ parameters.Env }} environment"
                tree "$(Pipeline.Workspace)" /F
              displayName: '2 Show environment and treeview Pipeline_Workspace'

            ###################################
            # 3 Create containers in datalake
            ###################################
            - task: AzurePowerShell@5
              displayName: '3 Create data lake containers'
              inputs:
                azureSubscription: ${{ parameters.SerCon }}
                scriptType: filePath
                scriptPath: $(Pipeline.Workspace)\s\CICD\PowerShell\SetStorageAccounts.ps1
                scriptArguments:
                  -ResourceGroupName ${{ parameters.ResourceGroupName }} `
                  -StorageAccountName ${{ parameters.StorageAccountName }} `
                  -JsonLocation $(Pipeline.Workspace)\s\CICD\Json\
                azurePowerShellVersion: latestVersion
                pwsh: true

5) The result

Now it's time to run the YAML pipeline and check the Storage Account to see wether the containers and folders are created.

DevOps logs of creating containers and folders

Created data lake folders in container

Conclusion

In this post you learned how to create containers and folders in the Storage Account / Data Lake via a little PowerShell script and a DevOps pipeline, but you can also reuse this PowerShell script in for the mentioned alternative solutions.

Again the note about also deleting containers and folders. Make sure to double check the code, but also the procedures to avoid human errors and potenially loose a lot of data. You might want to setup soft deletes in your storage account to have a fallback scenario for screwups.

Monday, 3 October 2022

Deploy Synapse workspaces via DevOps - Pipeline

Case
I want to deploy my development Synapse workspace to the next environment (test, acceptance or production). What is the easiest way to automate this proces via DevOps? And is it possible to ignore the publish button just like in Data Factory.

Release Synapse Workspace via DevOps

Solution
With the new (updated) Synapse add-on for DevOps it is much easier to release Synapse then it was to release Data Factory. And if you use the validateDeploy operation (instead of deploy) then you don't need the workspace_publish branch. It can directly read from the collaboration branch so that you don't have to use the publish button to initiate the CICD proces.

This solution contains of two separate main posts and a couple of side posts.

Setup Synapse and DevOps in preparation of the pipeline.
Setup the YAML pipeline to do the actual deployment (this post).

Addiontal posts

Cleanup Synapse before deployment
Cleanup Synapse after deployement (faster) comming soon

5) Preparation

Make sure to do the preparations described in the previous post. Add two empty files to the CICD\YAML folder (or your own folder setup):

Synapse.yml
DeploySynapse.yml

Add two YAML files

6) The YAML pipeline

In this example we will create an artifact first and then deploy that artifact to the test/acceptance/production environment of Synapse, but depending on the branch strategy you could skip that step and directly publish from a branch in the repository. The artifact build and artifact publish are separated in two different YAML files.

Synapse.yml

First step is the trigger. When should the pipeline start? In this case if the branch name is equals to 'main' and the change is happening in the 'Synapse' folder. So changes in the 'CICD' folder will not trigger the pipeline.

###################################
# When to create a pipeline run
###################################
trigger:
  branches:
    include:
    - main
  paths:
    include:
    - Synapse/*

The second step is creating the first stage. It will become the first blue, green or red circle in the overview. It also cleans the workspace, which is handy for self-hosted agents. And it mentions the agent pool. In this case a Microsoft-hosted agent.

stages:
###################################
# Create Artifact of Synapse files
###################################
- stage: CreateSynapseArtifact
  displayName: Create Synapse Artifact

  jobs:
  - job: CreateArtifact
    displayName: 'Create Actifact'
    workspace:
      clean: all
    pool:
      vmImage: 'windows-latest' #'ubuntu-latest'
    steps:

The third block (step 1) retrieves the content of the repository to the agent. This allows us to create an artifact of the Synapse files that are stored in the repository.

    ###################################
    # 1 Retrieve Repository
    ###################################
    - checkout: self
      displayName: '1 Retrieve Repository'
      clean: true

The fourth block (step 2) is optional. It just shows a treeview of the agent which is very handy for debugging your YAML pipelines to make sure you are mentioning the right folder or file in any of the tasks. This is explained in detail in a previous post.

    ###################################
    # 2 Show treeview of agent
    ###################################
    - powershell: |
        tree "$(Pipeline.Workspace)" /F
      displayName: '2 Treeview of Pipeline.Workspace'

The fifth block (step 3) is the coping of all Synapse files to an artifact staging folder. Optionaly you could skip this part an publish directly from the Synapse folder.

    ###################################
    # 3 Stage artifact
    ###################################
    - task: CopyFiles@2
      displayName: '3. Copy Artifact'
      inputs:
        contents: |
          **\*.*
        SourceFolder: 'Synapse'
        TargetFolder: '$(build.artifactstagingdirectory)'

The sixth block (step 6) is publising all the files located in the artifact staging folder.

    ###################################
    # 4 Publish artifact
    ###################################
    - task: PublishPipelineArtifact@1
      displayName: '4 Publish template as artifact'
      inputs:
        targetPath: $(Build.ArtifactStagingDirectory)
        artifact: 'SynapseArtifact'
        publishLocation: 'pipeline'

The last block in this YAML file is calling the next YAML file with parameters so that you can reuse this step for all environments (Test/Acceptance/Production). The values are now hardcoded in this example, but you should ofcourse try to use a variable group from the Pipeline Library. This makes it much easier to change those parameter values.

This example contains 4 variables. The first is just to show the enviroment when writing values to the screen (debugging). The ServiceConnection is the name of your ARM Service Connection that you created in the preparation post. The last two are to point to the correct environment of Synapse.

###################################
# Deploy Acceptance environment
###################################
- stage: DeployAcc
  displayName: Deploy ACC
#   variables:
#     - group: SynapseParametersAcc
  pool:
    vmImage: 'windows-latest'
  jobs:
    - template: DeploySynapse.yml
      parameters:
        Environment: ACC
        ServiceConnection: SynapseServiceConnection
        ResourceGroupName: rg_dwh_acc
        TargetWorkspaceName: dwhacc

DeploySynapse.yml

The first code block are the parameters. In this example the 4 string parameters with the values that are provided in the first YAML file.

###################################
# Parameters
###################################
parameters:
  - name: Environment
    displayName: Environment
    type: string
    values:
    - TST
    - ACC
    - PRD
  - name: ServiceConnection
    displayName: Service Connection
    type: string
  - name: ResourceGroupName
    displayName: Resource Group Name
    type: string
  - name: TargetWorkspaceName
    displayName: Target Workspace Name
    type: string

The second block consist of some starter code, but the environment piece is important if you want to add rules like approvals. So make sure to create a environment 'Deploy Synapse to ACC' or choose your own name.

###################################
# Start
###################################
jobs:
    - deployment: deploymentjob${{ parameters.Environment }}
      displayName: Deployment Job ${{ parameters.Environment }} 
      environment: Deploy Synapse to ${{ parameters.Environment }}

      strategy:
        runOnce:
          deploy:
            steps:

The third block (step 1) is about getting the data from the repository. If you use the artifact then you could skip this code because the job will automatically start with a download artifact step. If you want to directly start publising from the collaboration branch or you need to execute for example some extra PowerShell scripts from the repos as well then you need this step.

If you want to use the publish branch then you will find some example code for that as well. This allows you to keep the the YAML files in the collaboration branch instead of the publish branch. You will need to change the operation in the last step to 'deploy' and change some of its properties (not discribed in this post).

            ###################################
            # 1 Check out repository to agent
            ###################################
            # - checkout: 'git://YourProjectName/YourReposName@workspace_publish'
            #   path: workspace_publish
            - checkout: self
              displayName: '1 Retrieve Repository'
              clean: true

The fourth block (step 2) is again the optional treeview to check the path of folders and files on your agent. Very handy, but once your code works fine then you can comment-out this part.

            ###################################
            # 2 Show environment and treeview
            ###################################
            - powershell: |
                Write-Output "Deploying Synapse in the ${{ parameters.Environment }} environment"
                tree "$(Pipeline.Workspace)" /F
              displayName: '2 Show environment and treeview Pipeline_Workspace'

The fifth and last block (step 3) is the actual deployment of Synapse. The DeleteArtifactsNotInTemplate option is to remove pipelines, datasets, linkedservice, etc. from your test/acceptance/production environment that you removed from the development environment. This is also the place where you can replace parameters and linked service which will be explained in a separate post.

            ###################################
            # 3 validateDeploy
            ###################################
            - task: Synapse workspace deployment@2
              displayName: '3 Deploy Synapse Workspace'
              inputs:
                operation: validateDeploy
                ArtifactsFolder: '$(Pipeline.Workspace)/SynapseArtifact'
                azureSubscription: ${{ parameters.ServiceConnection }} 
                ResourceGroupName: ${{ parameters.ResourceGroupName }} 
                TargetWorkspaceName: ${{ parameters.TargetWorkspaceName }} 
                DeleteArtifactsNotInTemplate: true
                # OverrideArmParameters: '
                # -workspaceName $(syn_wrk_name)
                # -ls_akv_mykeyvault_properties_typeProperties_baseUrl $(syn_mykeyvault)
                # '

Note 1: If you get an error Stderr: error: missing required argument 'factoryId', then please check this post

Note 2: If you get an error: Stderr: 'node' is not recognized as an internal or external command, operable program or batch file, then please check this post

7) The result

Now create a pipeline of an existing YAML file in your reposity and make sure to run the pipeline (manually or triggered) to see the result.

Successfully deployed Synapse

Conclusion

In this second post we described all the steps of the YAML pipeline and succesfully executed the pipeline. In a follow up post we will explain more details about overriding parameters during the deployment. Also see Microsofts own documentation for CICD for Synapse, but at the moment of writing it is not up to date with info of task version 2.

To see the available operations and related properties of this task you can also use the 'Show assistant' option in the YAML editor in Azure DevOps. An other option is to use the Release Pipeline editor and then hit the View YAML button.

Gui of the task via Show Assistant

View YAML of Release pipeline task

Deploy Synapse workspaces via DevOps - Setup

Release Synapse Workspace via DevOps

Setup Synapse and DevOps in preparation of the pipeline (this post).
Setup the YAML pipeline to do the actual deployment.

Addiontal posts

Cleanup Synapse before deployment
Cleanup Synapse after deployement (faster) comming soon

1) Setup Git repository

Setup your Synapse Workspace to use a Git repository. You can find this in Synapse under the toolbox icon (manage) in the left menu. Beside choosing the right Collaboration branch (that differs per organization and branch strategy), it is also usefull to change the Root folder to for example /Synapse/. This allows you to create a separate folder in the root for your CICD files like YAML and PowerShell scripts.

Git repository setup in Synapse

In your repository it should look something like this where the Synapse files are separated from the CICD files. Make sure to create a CICD folder and a YAML sub-folder to accommodate the pipeline files from the next post.

Synaose in the (DevOps) Repository

2) Give Service Principal Access

To do the actual deployment of the Synapse Workspace, you want to use a Service Principal. Create one or ask your AAD administrator to provide one if you are not authorized to create one yourself.

We want to give this Service Principal (SP) the minimal rights in the target workspace to do the deployment. For this we will give it the Synapse Artifact Publisher role within Synapse. You can do this in Synapse under the toolbox icon (manage) in the left menu. Then choose Access control and use the +Add button to give the SP the correct role. In the next step we will create a Service Connection in Azure DevOps with this SP. Do this for all target workspaces (tst/acc/prd).

Access control - Make SP Synapse Artifact Publisher

If your Service Principal didn't get the correct authorization then you will get the following error during the deployment in DevOps.

Start deploying artifacts from the template.
Deploy LS_AKV_AAA of type linkedService
For Artifact: LS_AKV_AAA: ArtifactDeploymentTask status: 403; status message: Forbidden
Failed
deploy operation failed
An error occurred during execution: Error: Linked service deployment failed "Failed"
##[error]Encountered with exception:Error: Linked service deployment failed "Failed"
For Artifact: LS_AKV_AAA: Deploy artifact failed: {"error":{"code":"Unauthorized","message":"The principal 'aaaaaaaa-bbbb-cccc-dddd-12345678' does not have the required Synapse RBAC permission to perform this action. Required permission: Action: Microsoft.Synapse/workspaces/linkedServices/write, Scope: workspaces/mySynapseAcc."}}

Unauthorized

3) Setup DevOps Service Connection

The next step is to create a Service Connection in DevOps. In the Project settings of your DevOps project you can find the option Service connections under Pipelines. You need to create a new Service Connection of the type Azure Resource Manager (ARM) for which you need the Service Principal Id (application id), the Service principal key (the secret) and the Tenant Id of your Azure Active Directory. Make sure to give the service connection a useful name. You will need the Service Connection name in the YAML code of the next post.

Add Service Connection

4) Add Synapse workspace deployment Add on

Microsoft made the deploy of a Synapse workspace a little easier then for Data Factory by creating a DevOps add-on for Synapse. You need to add this to your DevOps Organization by clicking the green button with Get it free. If you are not an DevOps Organization administrator then you need to ask someone else to approve the installation.

Synapse workspace deployment addon

If you already have this add-on then make sure to update it to at least 2.3.0. You can find the add-on in the Organization Setting under General - Extensions.

Check version of extension

Conclusion

In this first post we showed some preparations that are not that difficult, but you will need the right access for it or be able to ask a colleague for it that does have access to the AAD and the DevOps organization. In the next post we will create a YAML pipeline that consists of two YAML files to do the actual deployment.