Tuesday, 24 October 2023

Review Synapse notebooks with your GPT model

Case
Since the introduction of ChatGPT in late 2022, people have started to discover the various applications of Large Language Models (LLMs) in coding and beyond. LLM’s can be a helpful sparring partner in developing and reviewing code. As an IT consultant, I've started using LLMs in my coding practices. This made me wonder: ‘Can I automate the use of LLM in my development process?’. Which is Azure Synapse in my case.

And the short answer is: ‘yes, you can’


 

 

 





 



Solution
This blog serves as a step-by-step guide to integrating GPT 3.5 into your Azure DevOps pipeline for automated code reviews. The solution that I’m proposing checks which files have changed between the source branch and the target branch in a pull request. If one of these changed files is a Synapse notebook, the code is passed on the GPT model on your tenant with the request to provide feedback. The feedback given by the GPT model is posted as a comment in the pull request.

Before you start, make sure you have the following three resources set up and configured correctly. I’ll include links to other step-by-step guides to create the required resources.

Required resources:

*Access to the Microsoft Azure OpenAI service is limited at the time of writing.

1) Create a GPT model in Azure OpenAI service

Once you have gained access to the Azure OpenAI service you need to create an OpenAI service and deploy a model. Click here to follow the comprehensive guide by Christopher Tearpak on creating and deploying your GPT model in Azure. The result should be an OpenAI service resource:

Expected result after creating your own OpenAI Service










With a GPT model deployment:
Expected result after deploying your own GPT model











2) Setup and configure Azure DevOps

Scripts
Three scripts must be present in your repository. These scripts can be found at the SynapseBuildValidations repository (which also contains other useful build validation scripts).

Download the scripts from the SynapseBuildValidations repository. Here is what each script does:

  • get_GPT_feedback.py: Retrieves and passes your code to GPT and posts the feedback to the pull request. 
  • requirements.txt: Lists the required Python packages for the above script. 
  • GPT_code_review.yml: Contains the pipeline configuration.
Expected folder/file structure for your repository

 


 



 

Create variable group
Create a new library under “Pipelines” and give it the name: “GPT_connection_data”.
Add the following variables:

  • openai_api_base
  • openai_api_key
  • openai_api_type
  • openai_api_version
  • synapse_root_folder

The variables openai_api_base and openai_api_key can be found in the “Keys and Endpoint” tab in your OpenAI resource.

Find your Key and Endpoint in "Keys and Endpoint" of your OpenAI service








Copy the “Endpoint” to “Openai_api_base” and one of the keys to “openai_api_key”.
Openai_api_type needs to be filled with: “azure” and openai_api_version “2023-03-15-preview”.

The variable synapse_root_folder contains the path to the root folder containing the synapse files. In my case it’s just “Synapse”, because all the synapse files can be found in {repository_root}/Synapse
An example of my repository, where the synapse_root_folder is "Synapse"













After you’ve set all your variables, the resulting variable group should look like this:

Expected variable group in Azure DevOps












Create a pipeline

The GPT_code_review.yml contains the pipeline code needed to configure the agent and execute the get_GPT_feedback.py script. You need to create a new pipeline based on the GTP_code_review.yml.

Click here to follow the comprehensive guide by Xeladu on creating a pipeline.

The result should be as follows:

The resulting pipeline after following the guide




Disable “Override YAML continuous integration (CI) trigger”

Now you’ll need to disable the CI trigger of your new pipeline.
Open the pipeline in edit mode, select the three vertical dots and select “Trigger”
Select triggers to change the trigger settings














Then select your project under “Continuous integration”, check “Override the YAML CI trigger”, check disable CI and select “save”.

Steps to disable the CI trigger






Permit pipeline access to variable group resource

After you’ve disabled the CI trigger, you’ll need to start a run. During the run you’ll get a notification that the pipeline needs permission to access a resource. Click on the “View” button and permit the access to the variable group GPT_connection_data by clicking the “Permit” button. The run will continue and eventually fail.

Permit the pipeline to access the variable group







Note that this pipeline is designed to operate only within the context of a pull request. Because it’s dependent on a few system variables that are only present on the build agent during a pull request.

Set rights for build agent
The GPT feedback is posted in comments of the pull request. The build agent needs to have the right to post in the pull request. Go to “Project settings” and select “Repositories”, when you’re in “Repositories” select security and select your “{Projectname} Build Service“.
Steps to select the build service user




















Once the build service user is selected, you must grant this user permission to contribute to pull requests.

Set the "Contribute to pull request" to Allow





















After these actions, your build agent user has access to write comments to a pull request.

Add the pipeline as build validation
The final step is adding your GPT_Feedback_pipeline as build validation on your desired branch. Go to “Project settings”, select “Repositories” and select the repository where you want to add the pipeline (“DemoProject” in my example). With the repository select “Policies”
Steps to get to the branch you want to set the build validation to
















Select the branch where you want to have the build validation.
Select the branch you want to set the build validation to










Within this branch, select the plus icon in the build validation component.
Select the plus to add a pipeline as build validation






In the “Add build policy” pop-up, select the build pipeline: “GPT_Feedback_Pipeline”, set a display name and select “Save”

Pop-up to select a pipeline
as build validation
























Now you’re good to go! When you merge another branch into the branch on which you have enabled the branch validation the GPT_Feedback_pipeline will run.

3) Testing

Now its time to perform a pull request. You will see that the validation will first be queued. So this extra validation will take a little extra time, especially when you have a busy agent. The pipeline should always run without errors. When there is no feedback, there won’t be any comments. This means that your code is perfect :) or there aren’t any changes to any of the notebooks. However, when there is feedback, it will be posted in the comments.

Let’s see de build pipeline in action. First off, we need a new branch, in my case a feature branch
  1. Create a new branch in Synapse called Feature
  2. Create a new notebook
  3. Create some sample code in python or SQL
Sample code in Azure Synapse


















  1. Create a pull request based on your new branch
  2. Fill out the details of your pull request and make sure you’re merging into the branch with the GPT build validation
  3. Confirm the creation of your pull request.
  4. Your pull request is created and the GPT_Feedback pipeline starts running
  5. After the pipeline has run successfully and the GPT model gave feedback for improvement. The feedback is posted in the comments of the merge request
GPT response in the comments of your pull request
























4) Altering the feedback to your situation

The prompt sent to the GPT model is pre-determined and might not suit your specific situation. At the time of writing, the prompt is included in the get_GPT_feedback.py script. This script contains a get_gpt_response function. The first lines of this function are used to set three strings. These strings contain the prompt for the “system”, “user” and “assistant” roles passed to the GPT model. More on the use of these roles can be found here. To alter the prompts passed to the GPT model, you need to alter the strings: content_system_string, content_assistant_string and/or content_user_string.
Subset of the get_GPT_feedback.py where the GPT commands are set
















Conclusion
In this post you learned how to integrate GPT into your deployment pipeline. Your pipeline will provide feedback on your changed of added synapse notebooks. The feedback is posted as comments in the pull request in Azure DevOps. You can customize the prompt to suit your specific needs.

Wednesday, 28 June 2023

Synapse and ADF Pipeline snack - Disable Activity

Case
Something that we already had in SSIS, but was never implemented in Data Factory or Synapse... until now! You can disable Pipeline activities with Activity State (in preview at the moment of writing)
Disabling Pipeline Activities!
























Solution
In each activity (ADF and Synapse) you now have the option to change the state of an Activity from Active to Inactive and after that you can provice the end status that should be used when running this activity in the pipeline: Skipped, Succeeded or Failed.
Mark activity as Skipped, Succeeded or Failed






















This is extremly useful for debugging and testing parts of your pipeline. You can set it in the properties, but even more simpel is just right clicking the activity (just like in SSIS ♥).
Right click to change activity state





















An other great scenario is that you can now disable an activity that is not yet ready and validate the pipeline without getting any errors that required fields are not yet provided. Validating just ignores the disabled activities.

Validating ignores disabled activities














Conclusion
It only took almost 6 years, but this seemingly simple feature could be very useful. It is still in preview but available in Data Factory and Synapse pipelines.

Disable multiple activities at the same time





Friday, 5 May 2023

Synapse - Automatically check naming conventions

Case
We use naming conventions in Synapse, but sometimes it's just a lot of work to check if everything is correct. Is there a way to automatically check those naming conventions?
Automatically check Naming Conventions












Solution
You can use a PowerShell script to loop through the JSON files from Synapse that are stored in the repository. Prefixes for Linked Services, Datasets, Pipelines, Notebooks, Dataflows and scripts are easy to check since that is just a case of checking the filename of the  JSON file. If you also want to check activities or use the Linked Service or Dataset type to check the naming conventions then you also need to check the contents of those JSON files.

There are a lot of different naming conventions, but as long as you are committed to use one, it will make your Synapse workspace more readable and your logs easier to understand. In one glance you will immediately see which part of Synapse is causing the error. Above all it looks more professional and it show you took the extra effort to make it better.

However, having a lot of different naming conventions also makes it is hard to create one script to rule them all. We created a script to check the prefixes of all differents parts of Synapse and you can configure them in a JSON file. You can use this script to either run it once a while to occasionally check your workspace or run it as a Validation step for a pull request in Azure DevOps. Then it acts like a gatekeep that doesn't allow bad named items. You can make it a required step then you first have to solve the issues or make it an optional check and then you will get the result, but you can choose to ignore it. For new projects you should make it required and for big existing projects you should probably first make it optional for a while and then change it to required.

The PowerShell script, the YAML file and the JSON config example are stored in a public GitHub site. This allows us to easily improve the code for you and keep this blog post up-to-date. It also allows you to help out by doing suggestions or even to write some better code.


1) Folder structure repository
Just like the Synapse Deployment scripts we store these validation files in the Synapse respository. We have CICD and a SYN folder in the root. SYN contains the JSON files from the Synapse workspace. The CICD folder had three sub folders: JSON (for the config), PowerShell (for the actual script) and YAML (for the pipeline that is required for validation).

Download the files from the Github Repository and store these in your own Repository according the structure described above. If you use a different structure you have to change the paths in the YAML file.
Repository structure
























2) Create pipeline
Now create a new pipeline with the existing YAML file called NamingValidation4Synapse.yml. Make sure the paths in the YAML are following the folder structure from step 1 or change it to your own structure.

If you are not sure about the folders then there is a treeview step in the YAML that will show you the structure of the agent. Just continue with the next steps and after the first run check the result of the treeview step and change the paths in the YAML and run again. You can remove or comment-out the treeview step when everything works.

This example uses the Azure DevOps respository with the following steps:
  • Go to pipelines and create a new pipeline
  • Select the Azure Repos Git
  • Select the Synapse repository
  • Choose Existing Azure Pipelines YAML file
  • Choose the right branch and select the YAML file under path
  • Save it and rename it because the default name is equals to the repos name
Create new pipeline with existing YAML file












3) Branch validation
Now that we have the new YAML pipeline, we can use it as a Build Validation in the branch policies. The example shows how to add them in an Azure DevOps repository.
  • In DevOps go to Repos in the left menu.
  • Then click branches to get all branches.
  • Now hover you mouse above the first branch and click on the 3 vertical dots.
  • Click Branch policies
  • Click on the + button in the Build Validation section.
  • Select the new pipeline created in step 2 (optionally change the Display name)
  • Choose the Policy requirement (Required or Optional)
  • Click on the Save button
Repeat these steps for all branches where you need the extra check. Don't add them on feature branches because it will also prevent you doing manual changes in these branches.
All branches








Required or Optional
















4) Testing
Now its time to perform a pull request. You will see that the validation will first be queued. So this extra validation will take a little extra time, especially when you have a busy agent. However you can just continue working and wait for the approval or even press Set auto-complete to automatically complete the Pull Request when all approvals and validations are validated. However don't auto-complete if you chose to make it an optional validation because then it will cancel it as soon as all other validations are ready.

As soon as the validation is ready it will show you the first couple if errors.
Required Naming Convention Validation failed











You can click on those couple of errors to see the total number of errors and the first ten errors.
Get error count and first 10 errors

















And you can click on one of those errors to see the entire output including all errors and all correct items.

















And at the bottom you will find a summary with the total number of errors and percentages per Synapse part.
Summary example










Summary example











Or download your Synapse JSON files and run the PowerShell script locally with for example Visual Studio Code.
Running Naming Validations in VCode














Conclusion
In this post you learned how you could automate your naming conventions check. This helps/forces your team to consistantly use the naming conventions without you being some kind of nitpicking police officer. Everybody can just blame DevOps/themselves.

You can combine this validation with for example the branch validation to prevent accidentily choosing the wrong branch in a Pull Request.

Please submit your suggestions under Issues (and then bug report of feature request) in the GitHub site or drop it in the comments below.




Friday, 28 April 2023

DevOps - Build Validation to check branches

Case
We sometimes mess up when creating pull requests in Azure DevOps by accidentally selecting the wrong source or target branch. These mistakes caused unwantend situations where stuff got promoted to the wrong environment to early. Resolving those mistakes often take a lot of time. Is there a way to prevent these easily made mistakes?
Build Validation for Pull Requests
























Solution
The first option is to setup a four-eyes principle where someone else has to approve your work including choosing the right branches for your pull request. This can be done by setting the branch policies. Note that you need to do this for all your branches (excluding feature/personal branches).
Setting minimum number of reviewers for changes




















Onother option that you could use is build validations where you compare the source branch with the target branch in a script. Even better is to combine these two options, but if you (sometimes) work alone then this could be an alternative.

Note that this solution only works if you have at lease two branches and that you are also using feature branches.

1) Create extra YAML file
Add an extra YAML file called ValidatePullRequest.yml in your your current YAML folder.
Pipeline for validating pull request















The code of the new file can be found below, but you have to change two things in this code. The first thing is the names from you branches in order (on line 14). If you only have a main and a development branch then it will be [String[]]$branches = "main", "development". The second item is the name/wildcard of hotfix/bugfix branches that will need to be ignored by this check (on line 17). So if your bugfix branches always contain the word hotfix then it will become [String]$fixBranch = "hotfix". The script uses a like to compare the name.

trigger: none
steps:
- task: PowerShell@2
  displayName: 'Validate branches in Pull Request'
  inputs:
    targetType: 'inline'
    script: | 
        #######################################################################
        # PARAMETERS FOR SCRIPT
        #######################################################################
        # All branches in order. You can only do a pull request one up or down.
        # Feature and / or personal branches can only be pulled to or from the
        # latest branch in the row.
        [String[]]$branches = "main", "acceptance", "test", "development", "sprint"

        # Bugfix or hotfix branches can be pulled to and from all other branches
        [String]$fixBranch = "bugfix"


        #######################################################################
        # DO NOT CHANGE CODE BELOW
        #######################################################################
        $SourceBranchName = "$(System.PullRequest.SourceBranch)".toLower().Replace("refs/heads/", "") # sourceBranchName does not exist
        $TargetBranchName = "$(System.PullRequest.targetBranchName)"

        function getBranchNumber
        {
            <#
                .SYNOPSIS
                Get the order number of the branch by looping through all branches and checking then branchname
                .PARAMETER BranchName
                Name of the branch you want to check
                .EXAMPLE
                getBranchNumber -BranchName "myBranch"
            #>
            param (
                [string]$BranchName
            )
            # Loop through branches array to find a specific branchname
            for ($i = 0; $i -lt $branches.count; $i++)
            {
                # Find specific branchname
                if ($branches[$i] -eq $BranchName)
                {
                    # Return branch order number
                    # (one-based instead if zero-based)
                    return $i + 1
                }
            }
            # Unknown branch = feature branch
            return $branches.count + 1
        }


        # Retrieve branch order
        $SourceBranchId = getBranchNumber($SourceBranchName)
        $TargetBranchId = getBranchNumber($TargetBranchName)

        # Show extra information to check the outcome of the check below
        Write-Host "All branches in order: [$($branches -join "] <-> [")] <-> [feature branches]."
        Write-Host "Checking pull request from $($SourceBranchName) [$($SourceBranchId)] to $($TargetBranchName) [$($TargetBranchId)]."

        if ($SourceBranchName -like "*$($fixBranch)*")
        {
            # Pull request for bugbix branches are unrestricted
            Write-Host "Pull request for Bugfix or hotfix branches are unrestricted."
            exit 0
        }
        elseif ([math]::abs($SourceBranchId-$TargetBranchId) -le 1)
        {
            # Not skipping branches or going from feature branch to feature branch 
            Write-Host "Pull request is valid."
            exit 0
        }
        else
        {
            # Invallid pull request that skips one or more branches
            Write-Host "##vso[task.logissue type=error]Pull request is invalid. Skipping branches is not allowed."
            exit 1
        }

In this GitHub repository you will find the latest version of the code

Note 1: The PowerShell script within the YAML either returns 0 (success) of 1 (failure) and that the Write-Host contains a little ##vso block. This allows you to write errors that will show up in the logging of DevOps.

Note 2: The script is fairly flexible in the number of branches that you want use, but it does require a standard order of branches. All unknown branches are considered being feature branches. If you have a naming conventions for feature branches then you could refine the IF statement to also validate those naming conventions to annoy/educate your co-workers even more.

2) Create new pipeline
Now create a new pipeline based on the newly created YAML file from step 1.
  • Go to Pipelines in Azure DevOps.
  • Click on the New Pipeline button to create the new pipeline.
  • Choose the repos type (Azure DevOps Git in our example)
  • Select the right repositiory if you have mulitple repos
  • Select Existing Azure Pipelines YAML file.
  • Select the branch and then the new YAML file under Path.
  • Save it
  • Optionally rename the Pipeline name to give it a more readable name (often the repository name will be used as a default name)














3) Add Build Validation to branch
Now that we have the new YAML pipeline, we can use it as a Build Validation. The example shows how to add them in an Azure DevOps repository.
  • In DevOps go to Repos in the left menu.
  • Then click branches to get all branches.
  • Now hover you mouse above the first branch and click on the 3 vertical dots.
  • Click Branch policies
  • Click on the + button in the Build Validation section.
  • Select the new pipeline created in step 2 (optionally change the Display name)
  • Click on the Save button
Repeat these steps for all branches where you need the extra check. Don't add them on feature branches because it will also prevent you doing manual changes in these branches.
Adding the Build Validation as a Branch Policy






















4) Testing
Now its time to perform a 'valid' and an 'invalid' pull request. You will see that the validation will first be queued. So this extra validation will take a little extra time, especially when you have a busy agent. However you can just continue working and wait for the approval or even press Set auto-complete to automatically complete the Pull Request when all approvals and validations are validated.

If the validation fails you will see the error message and then you can click on it to also find the two regular Write-Host lines (60 and 61) with the list of branches and the branch names with there branch order number. This should help you to figure out what mistake your made.
Invalid pull request
























For the 'legal' pull request you will first see the validation being queued and then you will see the green circle with the check mark. You can click on it to see why this was a valid Pull request.
Valid pull request


















Conclusions
In this post you learned how to create Build Validations for your Pull Requests in Azure DevOps. In this case a simple script to compare the Source and Target branch from a Pull Request. In a second example, coming online soon, we will validate the naming conventions within a Synapse Workspace (or ADF) during a Pull Request. This will prevent your colleagues (you are of course always following the rules) from not following the naming conventions because the Pull Request will automatically fail.
Sneak preview














Special thanks to colleague Joan Zandijk for creating the initial version of the YAML and Powershell script for the branch check.

Wednesday, 5 April 2023

Synapse snack - Increment counter variable

Case
I want to increment a variable to count some actions. For example to break a until loop after 4 times. What is the Synapse (or ADF) equivalent of counter++ or counter = counter + 1 because you can not self reference in a variable expression.
The expression contains self referencing variable.
A variable cannot reference itself in the expression.



















Solution
Unfortunately you can not self reference a variable in an expression of a Set variable activity. If you want to increment a variable called Counter then you can't use an expresion like this:
@add(int(variables('Counter')), 1)
1) Add extra variable
The trick is to use two variables (and two Set Variable activities). In this example we have besides the Counter variable also a second 'helper' variable called NewCounterValue. The type is String for both since there isn't an integer type available.
Two variables to increment a variable value















2) Add first Set variable
Now add two Set variable activities to the canvas of your pipeline. In this case inside an UNTIL loop. The first activity will set the value of the helper variable called NewCounterValue. It retrieves the value of the original Counter variable, converts it to an Integer and then adds 1. After that it is converted back to a String:
@string(
    add(
        int(variables('Counter')),
        1
    )
)
Increment variable part 1










3) Add second Set variable
The second activity just retrieves the value of the new helper variable and uses it to set the value of the Counter variable. No need to cast the variable to an other type.
Increment variable part 2















4) Until loop
Now lets test it in an until loop where it stops the until loop after 4 times.
The Until loop









The Until loop in action



















Conclusions
In this post we showed you how to overcome a little shortcoming in Synapse and ADF. It's a handy contruction to do something max X times. For example checking the status max 4 times. The Until stops whenever the status changes or when the retry counter is 4 or higher.
@or(
    greaterOrEquals(int(variables('Counter')),4)
,   not(equals(activity('WEB_Check_Status').output.status,'PENDING'))
)
Case Example for retry