Wednesday, 5 April 2023

Synapse snack - Increment counter variable

Case
I want to increment a variable to count some actions. For example to break a until loop after 4 times. What is the Synapse (or ADF) equivalent of counter++ or counter = counter + 1 because you can not self reference in a variable expression.
The expression contains self referencing variable.
A variable cannot reference itself in the expression.



















Solution
Unfortunately you can not self reference a variable in an expression of a Set variable activity. If you want to increment a variable called Counter then you can't use an expresion like this:
@add(int(variables('Counter')), 1)
1) Add extra variable
The trick is to use two variables (and two Set Variable activities). In this example we have besides the Counter variable also a second 'helper' variable called NewCounterValue. The type is String for both since there isn't an integer type available.
Two variables to increment a variable value















2) Add first Set variable
Now add two Set variable activities to the canvas of your pipeline. In this case inside an UNTIL loop. The first activity will set the value of the helper variable called NewCounterValue. It retrieves the value of the original Counter variable, converts it to an Integer and then adds 1. After that it is converted back to a String:
@string(
    add(
        int(variables('Counter')),
        1
    )
)
Increment variable part 1










3) Add second Set variable
The second activity just retrieves the value of the new helper variable and uses it to set the value of the Counter variable. No need to cast the variable to an other type.
Increment variable part 2















4) Until loop
Now lets test it in an until loop where it stops the until loop after 4 times.
The Until loop









The Until loop in action



















Conclusions
In this post we showed you how to overcome a little shortcoming in Synapse and ADF. It's a handy contruction to do something max X times. For example checking the status max 4 times. The Until stops whenever the status changes or when the retry counter is 4 or higher.
@or(
    greaterOrEquals(int(variables('Counter')),4)
,   not(equals(activity('WEB_Check_Status').output.status,'PENDING'))
)
Case Example for retry











Saturday, 1 April 2023

Synapse - Cleanup workspace after deployment

Case
A while ago we created a script to clean up the Synapse Workspace before deploying new pipelines, datasets, etc. to Synapse. This helps you to remove old parts like for example triggers because the Synapse deployment is incremental and does not do any deletes. That script works fine, however for workspaces with a lot of items it could take up to 30 minutes to clean up everything. In that post we already mentioned some future improvements...
Cleanup Synapse Workspace after deployment



Update: use DeleteArtifactsNotInTemplate: true in deployment task to avoid powershell

Solution

This version of the cleanup script compares the items in the Synapse Workspace against the Artifact that you just deployed. Everything that is in the workspace but not in the artifact will now be deleted (afterwards). This will significantly shorten the cleanup period during deployment.

You need to add this YAML step to your Synapse Deployment pipeline (preferably after the Deployment step) and change the path of the PowerShell script to your setup. The first parameter is the Synapse Workspace Name of the environment you are deploying to. The second one is the corresponding Resource Group Name. The last one is the location of the artifact. In this case the Pipeline.Workspace variable + "s\Synapse" (see screenshot below of the treeview on the agent).
###################################
# 5 Cleanup Synapse
###################################
- task: AzurePowerShell@5
  displayName: '5 Cleanup Synapse'
  inputs:
	azureSubscription: ${{ parameters.ServiceConnnection }}
	scriptType: filePath
	scriptPath: $(Pipeline.Workspace)\s\CICD\Powershell\ClearSynapse.ps1
	scriptArguments:
	  -WorkspaceName ${{ parameters.TargetWorkspaceName }} `
	  -ResourceGroupName ${{ parameters.ResourceGroupName }} `
	  -ArtifactDirectory $(Pipeline.Workspace)\s\Synapse
	azurePowerShellVersion: latestVersion
	pwsh: true
Showing the path of the Artifact root folder















Save the PowerShell Script below in your repository and change the path in the above YAML script. We like to have a CICD folder in the root to store everything deployment related.
 
Repos folder structure























param (
   [Parameter (Mandatory = $true, HelpMessage = 'Synapse name')]
   [ValidateNotNullOrEmpty()]
   [string] $WorkspaceName,
   
   [Parameter (Mandatory = $true, HelpMessage = 'Resourcegroup name')]
   [ValidateNotNullOrEmpty()]
   [string] $ResourceGroupName,
   
   [Parameter (Mandatory = $true, HelpMessage = 'Artifact Directory')]
   [ValidateNotNullOrEmpty()]
   [string] $ArtifactDirectory
)

# Two default Linked Services we cannot remove
[string] $WorkspaceDefaultSqlServer = "$($WorkspaceName)-WorkspaceDefaultSqlServer"
[string] $WorkspaceDefaultSqlStorage = "$($WorkspaceName)-WorkspaceDefaultStorage"

# A little dummy protection to check whether you have configured the right folder
# If these folders are not available there is probably something wrong and we
# don't want to delete everything in the workspace.
if (!(Test-Path -Path (Join-Path $ArtifactDirectory "integrationRuntime")) -AND
    !(Test-Path -Path (Join-Path $ArtifactDirectory "linkedService")) -AND 
    !(Test-Path -Path (Join-Path $ArtifactDirectory "pipeline")))
    {
        Write-Output "Artifact folder $($ArtifactDirectory) "
        throw "Dummy protection - Probably not the right folder that stores your artifact"
    }


#######################################################
# 1) Checking for resource locks and removing them
#######################################################
Write-Output "==========================================="
Write-Output "1) Removing resource locks"
Write-Output "==========================================="

# Getting all locks on the Azure Synapse Workspace
$lock = Get-AzResourceLock -ResourceGroupName $ResourceGroupName -ResourceName $WorkspaceName -ResourceType "Microsoft.Synapse/workspaces"
Write-Output "Found $($lock.Count) locks"

# Check if the collection of Azure resource locks is not emtpy
if($null -ne $lock)
{
    # Looping through all resource locks to remove them one by one
    $lock | ForEach-Object -process {
        # Remove lock
        Write-Output "Removing Lock Id: $($lock.LockId)"
        Remove-AzResourceLock -LockId $_.LockId -Force
    }
}
Write-Output "Step 'Removing resource locks' completed`r`n"



#######################################################
# 2) Stop and remove Triggers not in Artifact
#######################################################
Write-Output "==========================================="
Write-Output "2) Stop and remove Triggers not in Artifact"
Write-Output "==========================================="

# Check if the artifact contains any triggers
if (Test-Path -Path (Join-Path $ArtifactDirectory "trigger"))
{
    # Getting all Triggers from Artifact
    $ArtifactTriggers = Get-ChildItem -Path (Join-Path $ArtifactDirectory "trigger") -Filter "*.json" | Select-Object -ExpandProperty BaseName

    # Getting all Triggers from Synapse
    $SynapseTriggers = Get-AzSynapseTrigger -WorkspaceName $WorkspaceName | Select-Object -ExpandProperty Name

    # Getting Triggers from Synapse that are not in the Artifact
    $Triggers = Compare-Object -ReferenceObject $ArtifactTriggers -DifferenceObject $SynapseTriggers | Select-Object -ExpandProperty InputObject
} else {
    # Fill collection with all existing triggers to remove them
    # because artifact doesn't contain any triggers anymore
    Write-Output "Path not found in Artifact, removing all existing Triggers in Synapse Workspace"
    $Triggers = Get-AzSynapseTrigger -WorkspaceName $WorkspaceName | Select-Object -ExpandProperty Name
}
Write-Output "Found $($Triggers.Count) Triggers that are not in the Artifact"

# Stopping all Triggers before deleting them
$Triggers | ForEach-Object -process { 
    Write-Output "Stopping Trigger $($_)"
    try {
        # Trying to stop each Trigger
        Stop-AzSynapseTrigger -WorkspaceName $WorkspaceName -Name $($_) -ErrorAction Stop
    }
    catch {
        if ($_.Exception.Message -eq "{}")
        {
            # Ignore failures for triggers that are already stopped
            Write-Output "Trigger stopped"
        }
        else {
            # Unexpected error
            Write-Output "Something went wrong while stopping trigger!"
            Throw $_
        }
    }
    # Remove trigger
    Write-Output "Removing Trigger $($_)"
    Remove-AzSynapseTrigger -Name $_ -WorkspaceName $WorkspaceName -Force
}
Write-Output "Step 'Stop and remove Triggers not in Artifact' completed`r`n"



#######################################################
# 3) Remove Pipelines not in Artifact
#######################################################
Write-Output "==========================================="
Write-Output "3) Remove Pipelines not in Artifact"
Write-Output "==========================================="

# Check if artifact contains any pipelines
if (Test-Path -Path (Join-Path $ArtifactDirectory "pipeline"))
{
    # Getting all Pipelines from Artifact
    $ArtifactPipelines = Get-ChildItem -Path (Join-Path $ArtifactDirectory "pipeline") -Filter "*.json" | Select-Object -ExpandProperty BaseName

    # Getting all Pipelines from Synapse
    $SynapsePipelines = Get-AzSynapsePipeline -WorkspaceName $WorkspaceName | Select-Object -ExpandProperty Name

    # Getting Pipelines from Synapse that are not in the Artifact
    $Pipelines = Compare-Object -ReferenceObject $ArtifactPipelines -DifferenceObject $SynapsePipelines | Select-Object -ExpandProperty InputObject
} else {
    # Fill collection with all existing pipelines to remove them 
    # because artifact doesn't contain any pipelines anymore
    Write-Output "Path not found in Artifact, removing all existing Pipelines in Synapse Workspace"
    $Pipelines = Get-AzSynapsePipeline -WorkspaceName $WorkspaceName | Select-Object -ExpandProperty Name
}
Write-Output "Found $($Pipelines.Count) Synapse Pipelines that are not in the Artifact"

# Trying to delete all pipelines. If a pipeline is still referenced
# by an other pipeline it will continue to remove other pipelines 
# before trying to remove it again... max 100 times. So don't create
# chains of pipelines that are too long
[int] $depthCount = 0
while ($Pipelines.Count -gt 0 -and $depthCount -lt 100)
{
    # Loop through collection of pipelines and try to remove them
    $Pipelines | ForEach-Object -process { 
        Write-Output "Trying to delete pipeline $($_)"
        Remove-AzSynapsePipeline -Name $_ -WorkspaceName $WorkspaceName -Force -ErrorAction SilentlyContinue
    }

    # Wait 2 seconds before retry and raise retry counter
    Start-Sleep -Seconds 2
    $depthCount += 1

    # Check if artifact contains any pipelines
    if (Test-Path -Path (Join-Path $ArtifactDirectory "pipeline"))
    {   
        # Getting all Pipelines from Synapse
        $SynapsePipelines = Get-AzSynapsePipeline -WorkspaceName $WorkspaceName | Select-Object -ExpandProperty Name
        
        # Getting Pipelines from Synapse that are not in the Artifact
        $Pipelines = Compare-Object -ReferenceObject $ArtifactPipelines -DifferenceObject $SynapsePipelines | Select-Object -ExpandProperty InputObject
    }  else {
        # Fill collection with all existing pipelines to remove them 
        # because artifact doesn't contain any pipelines anymore
        $Pipelines = Get-AzSynapsePipeline -WorkspaceName $WorkspaceName | Select-Object -ExpandProperty Name
    }
    
    # Check if there are any pipelines left for a retry
    if ($Pipelines.count -gt 0)
    {
        Write-Output "Still found $($Pipelines.Count) Synapse Pipelines that are not in the Artifact. Starting next iteration."
    } else {
        Write-Output "Deletion of Pipelines not in Artifact completed"
    }
}
# Error when you have circulair pipeline links or just way to many levels
if ($depthCount -eq 100)
{
    throw "Too many levels of child pipelines or circulair relations!"
}
Write-Output "Step 'Remove Pipelines not in Artifact' completed`r`n"



#######################################################
# 4) Remove Notebooks not in Artifact
#######################################################
Write-Output "==========================================="
Write-Output "4) Remove Notebooks not in Artifact"
Write-Output "==========================================="

# Check if artifact contains any pipelines
if (Test-Path -Path (Join-Path $ArtifactDirectory "notebook"))
{
    # Getting all Notebooks from Artifact
    $ArtifactNotebooks = Get-ChildItem -Path (Join-Path $ArtifactDirectory "notebook") -Filter "*.json" | Select-Object -ExpandProperty BaseName

    # Getting all Notebooks from Synapse
    $SynapseNotebooks = Get-AzSynapseNotebook -WorkspaceName $WorkspaceName | Select-Object -ExpandProperty Name

    # Getting Notebooks from Synapse that are not in the Artifact
    $Notebooks = Compare-Object -ReferenceObject $ArtifactNotebooks -DifferenceObject $SynapseNotebooks | Select-Object -ExpandProperty InputObject
} else {
    # Fill collection with all existing notebooks to remove them 
    # because artifact doesn't contain any notebooks anymore
    Write-Output "Path not found in Artifact, removing all existing Notebooks in Synapse Workspace"
    $Notebooks = Get-AzSynapseNotebook -WorkspaceName $WorkspaceName | Select-Object -ExpandProperty Name
}    
Write-Output "Found $($Notebooks.Count) Synapse Notebooks that are not in the Artifact"

# Loop through collection of Notebooks to delete them
$Notebooks | ForEach-Object -process {
    Write-Output "Deleting Notebook $($_)"
    Remove-AzSynapseNotebook -Name $($_) -WorkspaceName $WorkspaceName -Force
}
Write-Output "Step 'Remove Notebooks not in Artifact' completed`r`n"



#######################################################
# 5) Remove SQL-Scripts not in Artifact
#######################################################
Write-Output "==========================================="
Write-Output "5) Remove SQL-Scripts not in Artifact"
Write-Output "==========================================="

# Check if artifact contains any SQL Scripts
if (Test-Path -Path (Join-Path $ArtifactDirectory "sqlscript"))
{
    # Getting all SQL-scripts from Artifact
    $ArtifactSQLScripts = Get-ChildItem -Path (Join-Path $ArtifactDirectory "sqlscript") -Filter "*.json" | Select-Object -ExpandProperty BaseName

    # Getting all SQL-scripts from Synapse
    $SynapseSQLScripts = Get-AzSynapseSqlScript -WorkspaceName $WorkspaceName | Select-Object -ExpandProperty Name

    # Getting SQL-scripts from Synapse that are not in the Artifact
    $SQLScripts = Compare-Object -ReferenceObject $ArtifactSQLScripts -DifferenceObject $SynapseSQLScripts | Select-Object -ExpandProperty InputObject
} else {
    # Fill collection with all existing SQL Scripts to remove them 
    # because artifact doesn't contain any SQL Scripts anymore
    Write-Output "Path not found in Artifact, removing all existing SQL-Scripts in Synapse Workspace"
    $SQLScripts = Get-AzSynapseSqlScript -WorkspaceName $WorkspaceName | Select-Object -ExpandProperty Name
} 
Write-Output "Found $($SQLScripts.Count) Synapse SQL-Scripts that are not in the Artifact"

# Loop through collection of SQL scripts to delete them
$SQLScripts | ForEach-Object -Process {
    Write-Output "Deleting SQL-script $($_)"
    Remove-AzSynapseSqlScript -Name $($_) -WorkspaceName $WorkspaceName -Force
}
Write-Output "Step 'Remove SQL-Scripts not in Artifact' completed`r`n"



#######################################################
# 6) Remove Datasets not in Artifact
#######################################################
Write-Output "==========================================="
Write-Output "6) Remove Datasets not in Artifact"
Write-Output "==========================================="

# Check if artifact contains any datasets
if (Test-Path -Path (Join-Path $ArtifactDirectory "dataset"))
{
    # Getting all Datasets from Artifact
    $ArtifactDatasets = Get-ChildItem -Path (Join-Path $ArtifactDirectory "dataset") -Filter "*.json" | Select-Object -ExpandProperty BaseName

    # Getting all Datasets from Synapse
    $SynapseDatasets = Get-AzSynapseDataset -WorkspaceName $WorkspaceName | Select-Object -ExpandProperty Name

    # Getting Datasets from Synapse that are not in the Artifact
    $Datasets = Compare-Object -ReferenceObject $ArtifactDatasets -DifferenceObject $SynapseDatasets | Select-Object -ExpandProperty InputObject
} else {
    # Fill collection with all existing Datasets to remove them 
    # because artifact doesn't contain any Datasets anymore
    Write-Output "Path not found in Artifact, removing all existing Datasets in Synapse Workspace"
    $Datasets = Get-AzSynapseDataset -WorkspaceName $WorkspaceName | Select-Object -ExpandProperty Name
}
Write-Output "Found $($Datasets.Count) Synapse Datasets that are not in the Artifact"

# Loop through collection of Datasets to delete them
$Datasets | ForEach-Object -process { 
    Write-Output "Deleting Dataset $($_)"
    Remove-AzSynapseDataset -Name $_ -WorkspaceName $WorkspaceName -Force
}
Write-Output "Step 'Remove Datasets not in Artifact' completed`r`n"



#######################################################
# 7) Remove Linked Services not in Artifact
#######################################################
Write-Output "==========================================="
Write-Output "7) Remove Linked Services not in Artifact"
Write-Output "==========================================="

# Check if artifact contains any Linked Services
if (Test-Path -Path (Join-Path $ArtifactDirectory "linkedService"))
{
    # Getting all Linked Services from Artifact
    $ArtifactLinkedServices = Get-ChildItem -Path (Join-Path $ArtifactDirectory "linkedService") -Filter "*.json" | Where-Object {($_ -NotLike "*WorkspaceDefaultSqlServer.json" -and  $_ -NotLike "*WorkspaceDefaultStorage.json") } | Select-Object -ExpandProperty BaseName
    
    # Getting all Linked Services from Synapse
    $SynapseLinkedServices = Get-AzSynapseLinkedService -WorkspaceName $WorkspaceName | Where-Object {($_.Name -ne $WorkspaceDefaultSqlServer -and  $_.Name -ne $WorkspaceDefaultSqlStorage) } | Select-Object -ExpandProperty Name

    # Getting Linked Services from Synapse that are not in the Artifact
    $LinkedServices = Compare-Object -ReferenceObject $ArtifactLinkedServices -DifferenceObject $SynapseLinkedServices | Select-Object -ExpandProperty InputObject
} else {
    # Fill collection with all existing Linked Services to remove them 
    # because artifact doesn't contain any Linked Services anymore
    Write-Output "Path not found in Artifact, removing all existing Linked Services in Synapse Workspace"
    $LinkedServices = Get-AzSynapseLinkedService -WorkspaceName $WorkspaceName | Where-Object {($_.Name -ne $WorkspaceDefaultSqlServer -and  $_.Name -ne $WorkspaceDefaultSqlStorage) } | Select-Object -ExpandProperty Name
}
Write-Output "Found $($LinkedServices.Count) Synapse Linked Services that are not in the Artifact"

# Trying to delete all linked services. If a linked service is still
# referenced by an other linked service it will continue to remove 
# other linked services before trying to remove it again... 
# max 100 times. Example: KeyVault linked services
$depthCount = 0
while ($LinkedServices.Count -gt 0 -and $depthCount -lt 100)
{
     # Loop through collection of Linked Services and try to remove them
    $LinkedServices | ForEach-Object -process { 
        Write-Output "Trying to delete Linked Service $($_)"
        Remove-AzSynapseLinkedService -Name $_ -WorkspaceName $WorkspaceName -Force -ErrorAction Continue
    }

    # Wait 2 seconds before retry and raise retry counter
    Start-Sleep 2 
    $depthCount += 1

    # Check if artifact contains any Linked Services
    if (Test-Path -Path (Join-Path $ArtifactDirectory "linkedService"))
    {
        # Getting all Linked Services from Synapse
        $SynapseLinkedServices = Get-AzSynapseLinkedService -WorkspaceName $WorkspaceName | Where-Object {($_.Name -ne $WorkspaceDefaultSqlServer -and  $_.Name -ne $WorkspaceDefaultSqlStorage) } | Select-Object -ExpandProperty Name

        # Getting Linked Services from Synapse that are not in the Artifact
        $LinkedServices = Compare-Object -ReferenceObject $ArtifactLinkedServices -DifferenceObject $SynapseLinkedServices | Select-Object -ExpandProperty InputObject
    } else {
        # Fill collection with all existing Linked Services to remove them 
        # because artifact doesn't contain any Linked Services anymore
        $LinkedServices = Get-AzSynapseLinkedService -WorkspaceName $WorkspaceName | Where-Object {($_.Name -ne $WorkspaceDefaultSqlServer -and  $_.Name -ne $WorkspaceDefaultSqlStorage) } | Select-Object -ExpandProperty Name
    }
    
    # Check if there are any Linked Services left for a retry
    if ($LinkedServices.count -gt 0)
    {
        Write-Output "Still found $($LinkedServices.Count) Synapse Linked Services that are not in the Artifact. Starting next iteration."
    } else {
        Write-Output "Deletion of Linked Services not in Artifact completed"
    }
}

# Error when you have circulair Linked Services links or just way to many levels
if ($depthCount -eq 100)
{
    throw "Too many levels of references to other Linked Services!"
}
Write-Output "Step 'Remove Linked Services not in Artifact' completed"
Write-Output "==========================================="


There is a check at the start that will check whether it can find some expected sub folders in the supplied artifact path. Otherwise it will asume that you don't have any datasets, pipelines, etc. and cleanup your entire Synapse Workspace.






If nothing has to be cleaned/removed the script is ready within seconds.
No cleanup necessary



















If items do have to be removed it is still ready in a few minutes instead of half an hour. Notice the iterations for cleaning up the pipelines. It will try to delete a pipeline, but if it is still used by an other pipeline then it will continue with the next pipeline and try again afterwards. The same construction is used for Linked Services where for example a Key Vault Linked Service can still be used by an other Linked Service.
Some cleanup was necessary




















Conclusions
In this post you learned how to clean your Synapse Workspace much more efficiently by comparing the Workspace and the Artifact. This way you don't have to delete each item in your workspace, but only the ones you deleted during development.

Note that not all parts of Synapse are available in this clean up script. For example KQL scripts and Dataflows are still missing. They will be added later on. If you want to add those yourself make sure they are on the right place within the script Dataflows should probably added before the pipelines and the KQL scripts in front of or after the SQL Scripts. Feel free to let us know if you have any improvements for this script that you would like the share with the community.

Special thanks to colleague Joan Zandijk for helping out.

Sunday, 26 March 2023

Synapse - Change Data Feed (CDF) on delta tables

Case
I like the data versioning of the Delta Tables and I know how to get data from different versions, but how can I combine that in one query to get for example the changes during a year to create a nice fact table about certain changes.
Change Data Feed in Synapse




















Solution
Change Data Feed (CDF) is still a bit new. The currently supported Delta Lake version in the Synapse workspace is 2.2.0. This version does not yet not support CDF for SQL queries. This shoud be available in Delta Lake 2.3.0 according to the release documentation. Luckily you can already use PySpark to get this information.
Current Delta Lake version in Synapse










1) Enable Change Data Feed
First you have to enable the Change Data Feed option on your Delta table. From that point in time you can use CDF. The property is called enableChangeDataFeed.

You can alter your existing tables with an Alter statement
%%sql

ALTER TABLE silver.Cities
  SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
For new Delta Tables you can also do this in the Create Table command.
%%sql

CREATE TABLE Silver.Cities (Id INT, City STRING, Population INT)
  TBLPROPERTIES (delta.enableChangeDataFeed = true);
And if you used the PySpark code from our previous post, then you can add an option just in front of the save.
sdf.write.format('delta').option("delta.enableChangeDataFeed", "true").save(delta_table_path)
To check whether it is enabled on your Delta Table you can use the following command.
%%sql

SHOW TBLPROPERTIES silver.cities
CDF is enabled













2) Check available data versions
Now that we have the Change Data Feed option available, lets check which data versions we have with the DESCRIBE HISTORY command. In the first example you will see that CDF is enabled after table creation in the second version (1). This means you can not include the first version (0) in the CDF command.

You will get an error if you set the range wrong while getting CDF info:
AnalysisException: Error getting change data for range [0 , 4] as change data was not
recorded for version [0]. If you've enabled change data feed on this table,
use `DESCRIBE HISTORY` to see when it was first enabled.
Otherwise, to start recording change data, use `ALTER TABLE table_name SET TBLPROPERTIES
(delta.enableChangeDataFeed=true)`.
CDF available from version 1







In the second example it was enabled during the Delta table creation and therefore CDF is available from the first version (0).
CDF available from version 0








3) Query CDF data
When you query the CDF data you will get some extra columns:
  • _change_type: showing what action was taken to change the data - insert, update_preimage, update_postimage and delete
  • _commit_version: showing the version number of the data
  • _commit_timestamp: showing the timestamp of the data change
If you want particular versionnumbers when getting the data, then you can use startingVersion and endingVersion as an option while reading the data. Only startingVersion is also permitted.
%%pyspark

df = spark.read.format("delta") \
  .option("readChangeFeed", "true") \
  .option("startingVersion", 0) \
  .option("endingVersion", 5) \
  .table("Silver.cities")

display(df.sort("City","_commit_version"))
Filter CDF on version numbers













Probably more useful is to query date ranges, then you can use startingTimestamp and endingTimestamp as an option. Only startingTimestamp is also permitted.
%%pyspark

df = spark.read.format("delta") \
  .option("readChangeFeed", "true") \
  .option("startingTimestamp", '2023-03-26 11:07:23.008') \
  .table("Silver.cities")

display(df.sort("City","_commit_version"))
Filter CDF on timestamps














If you want to use the new column _commit_timestamp from the next record to create a new column called end_timestamp in the current record, then you need to play with the lead() function (just like in TSQL).
%%pyspark

from pyspark.sql.window import Window
from pyspark.sql.functions import lead 

df = spark.read.format("delta") \
  .option("readChangeFeed", "true") \
  .option("startingTimestamp", '2023-03-26 11:07:23.008') \
  .table("Silver.cities")

# Create window for lead
windowSpec  = Window.partitionBy("City").orderBy("_commit_version")

# Remove update_preimage records, add new column with Lead() and then sort
display(df.filter("_change_type != 'update_preimage'")
           .withColumn("_end_timestamp",lead("_commit_timestamp",1).over(windowSpec))
           .sort("City","_commit_version"))
Create end_timestamp with lead function











Conclusions
In this post you learned the basics of the Change Data Feed options in Synapse. This feature is available in Delta Lake 2.0.0 and above, but it is still in experimental support mode. For now you have to use PySpark instead of Spark SQL to query the data in Synapse.

Besides creating nice fact tables to show data changes during a certain periode this feature could also be useful to incremental load a large fact table with only changes from the silver layer. Creating audit trails for data changes over time could also be an interesting option. The CDF option is probably the most useful when there are not that many changes in a table.

In a later post, when Delta Lake 2.3.0 is available in Synapse, we will explain the Spark SQL options for CDF. Special thanks to colleagues Roelof Jonkers and Martijn Broeks for helping out.



Wednesday, 22 March 2023

DevOps - Get-SpnAccessToken is obsolete

Case
I'm deploying my SQL Server database via DevOps with the SqlAzureDacpacDeployment@1 task in YAML, but it is giving me a warning: ##[warning]The command 'Get-SpnAccessToken' is obsolete. Use Get-AccessTokenMSAL instead. This will be removed. It is still working, but the warning message is not very reassuring. 
The command 'Get-SpnAccessToken' is obsolete.
Use Get-AccessTokenMSAL instead.
This will be removed











Solution
This warning message appeared somewhere late 2022 and there is no new version of the dacpac deployment task available at the moment. When searching for this message it appears that other tasks like AzureFileCopy@5 have the same issue. The word MSAL (Microsoft Authentication Library) in the message points to a new(er) way to acquire security tokens.

To get more info you could run the pipeline in debug modes by Enabling system diagnostics.
Enable system diagnostics















Then you will get see a lot of extra messages and right above the warning you will see a message about USE_MSAL (empty) and that its default value is false.
USE_MSAL











It is just a warning and Microsoft will probably solve it some day. If you want to get rid of it you can set an environment variable called USE_MSAL to true within your pipeline. When set to true the task will use MSAL instead of ADAL to obtain the authentication tokens from the Microsoft Identity Platform. The easiest way to do this is writing one line of PowerShell code in a PowerShell task: ##vso[task.setvariable variable=USE_MSAL]true 

###################################
# USE_MSAL to avoid warning
###################################
- powershell: |
    Write-Host "Setting USE_MSAL to true to force using MSAL instead of ADAL to obtain the authentication tokens."
    Write-Host "##vso[task.setvariable variable=USE_MSAL]true"
  displayName: '3 Set USE_MSAL to true'

###################################
# Deploy DacPac
###################################             
- task: SqlAzureDacpacDeployment@1
  displayName: '4 Deploy DacPac' 
  inputs:
    azureSubscription: '${{ parameters.ServiceConnection }}'
    AuthenticationType: 'servicePrincipal'
    ServerName: '${{ parameters.SqlServerName }}.database.windows.net'
    DatabaseName: '${{ parameters.SqlDatabaseName }}' 
    deployType: 'DacpacTask'
    DeploymentAction: 'Publish'
    DacpacFile: '$(Pipeline.Workspace)/SQL_Dacpac/SQL/${{ parameters.SqlProjectName }}/bin/debug/${{ parameters.SqlProjectName }}.dacpac'
    PublishProfile: '$(Pipeline.Workspace)/SQL_Dacpac/SQL/${{ parameters.SqlProjectName }}/${{ parameters.SqlProjectName }}.publish.xml'
    IpDetectionMethod: 'AutoDetect'

After this the warning will not appear anymore and your database will still get deployed. The extra step taks about a second to run.
Extra PowerShell Task


No more absolete warnings
















Conclusion
In this post you learned how to get rid of the annoying The command 'Get-SpnAccessToken' is obsolete warning by setting one environment variable to true. You should probably check in a few weeks/months whether this workaround is still necessary or if there is a SqlAzureDacpacDeployment@2 version.