Microsoft BI Tools: JOOST

Showing posts with label JOOST. Show all posts

Saturday, 7 December 2024

Synapse snack: mount storage container in notebook

Case
I want to mount a Storage Account container in my notebook to write some files to it. Preferably without storage account keys.

Mount Storage Account in notebook

Solution
You can use the Synapse Spark Utils to mount a storage account container within a notebook. It requires having a Linked Service that points to your Storage Account which is not using a Self Hosted Integration Runtime.

For this we use the default Linked Services which ends with [workspacename]-WorkspaceDefaultStorage. By default this Linked Service is using the System-assinged managed identity to authenicate to the Storage Account connected to Synapse.

Since we might be using this in a DTAP environment we don't want to hardcode the Linked Service name. Instead we will retrieve the Synapse workspace name with code and concat "-WorkspaceDefaultStorage" to it.

Default Workspace Linked Service

Now when you want to debug your code in Synapse Studio it's best to also make sure your sparkpool is also running as managed idenity otherwise it will use your personal account. Also make sure that role Storage Blob Data Contributor is assigned.

Now the python code. There is a lot displaying in the code to show what is happening, but there are no checks and no error handling in the code. You might want to add that when going to production to make your code more robust and get easier error messages.

# Import necessary modules
from notebookutils import mssparkutils  # Utilities for Synapse notebooks
import re  # Module for regular expressions
import os  # Module for interacting with the operating system

# Parameters
container_name = "mycontainer"  # Name of the storage container
display("Container: " + container_name)

# Retrieve Synapse Workspace Name to get the Linked Service
workspace_name = mssparkutils.env.getWorkspaceName()
display("Synapse Workspace: " + workspace_name)

# Determine the default Linked Service name using the Workspace Name
linked_service_name = f"{workspace_name}-WorkspaceDefaultStorage"
display("Default Linked Service: " + linked_service_name)

# Retrieve the full connection string from the Linked Service to retrieve storage account
connection_string = mssparkutils.credentials.getFullConnectionString(linked_service_name)
display("Connection String: " + connection_string)

# Extract storage account name from the connection string using regex
# Expected format: url=https://{storage_account_name}.dfs.core.windows.net
storage_account_pattern = r'//([^\.]+)\.'
storage_account_name = re.search(storage_account_pattern, connection_string).group(1)
display("Storage Account: " + storage_account_name)

# Mount the Storage Account container via the Linked Service
m = mssparkutils.fs.mount(
    f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net",
    "/mymount",
    {"linkedService": linked_service_name}
)

# Get the local path of the mounted container folder
# Example path format: '/synfs/notebook/{job_id}/{mount_name}'
root_folder = mssparkutils.fs.getMountPath("/mymount")
display("Root Folder: " + root_folder)

# Create a file path for a new dummy file within the mounted folder
# Note that the subfolder should exists or add code to create it.
file_name = os.path.join(root_folder, "myfolder/dummy_file.txt")
display("Creating File: " + file_name)

# Content to be written to the new dummy file
file_content = "This is an example file created and saved using Python."

# Open the file in write mode and write the content to it
with open(file_name, 'w') as file:
    file.write(file_content)
display("File created")

# Clean up mount 
um = mssparkutils.fs.unmount("/mymount")

Now you will see the file appear in your storage account.

dummy file created in storage account

Conclusion

In this post we showed you how to use the managed idenity from a Linked Service to create a mount to the storage account container within a notebook. With this you can easily create or edit files in your Storage account when the pipeline doesn't provide the options you need.

Note: this code uses 'regular' Python and not PySpark which in some cases is unavoidable. However this also means that in Synapse this code is only running on the headnode. So all child nodes of your cluster are being totally useless and costing you money. For short processes that doesn't hurt that much, but for longer running jobs you should also consider alternatives like Azure Function Apps

Thanks to Jan Remko Yntema

Sunday, 22 September 2024

Synapse - Run all notebooks in a foreach loop

Case
I have several notebooks to create/fill Silver and Gold tables which I want to execute. Is their a way to execute several notebooks in a foreach loop? That would help me not having to create and maintain a pipeline with just lots of Notebook activities.

Notebook activity in FOREACH

Solution
The properties of the Notebook activity can be overridden with an expression in the Dynamic Content section. This means you can use a collection of notebook names and execute them within a Foreach activity.

The smart part of the solution is creating the collection of notebooks so that you can iterate it, but is also always up to date. This solution uses the Synapse rest API to get all notebooks from Synapse and then use a filter to only get a selection of all those notebook.

The hard part is that the rest API will return mulitple pages with notebooks if you have a lot of notebooks. The Entire solutions looks like this. Only the last part is for looping and executing the notebooks. The first few activities are all for retrieving the collection of notebooks. It perhaps looks a bit complex, but looks can be deceiving.

The solution

1) Parameter

This solution uses one string pipeline parameter to provide a folder path that we can use to filter out a selection of notebooks. Name of the parameter is NotebookFolderPath.

String parameter to get a selection of notebooks

2) Variables

The UNTIL loop to retrieve all notebooks uses three variables. The first is a String variable containing the URL of the rest API: SynapseRestAPI. The second and third are Array variables to store the response of the Rest API: Notebooks and Notebooks_Temp.

The pipeline variables to retrieve and store the notebooks

3) Determine Rest API URL

The Rest API will return multiple pages with details of all notebooks if you exceed the max number of notebooks per page. Therefore we will first determine the initial URL for the first page before the UNTIL loop. Lateron the URL of the successive pages is retrieved within the UNTIL loop by reading the response of the Rest API.

For the initial URL we need the name of the Synapse workspace. Since this can be retrieved via an expression, we can make the URL dynamic:

@concat(
    'https://'
    ,pipeline().DataFactory
    ,'.dev.azuresynapse.net/notebooks?api-version=2021-04-01'
    )

For this we will start with a Set Variable activity in front of the UNTIL loop.

Set Variable task to determine Rest API URL

4) Until loop

Within the UNTIL loop we use a WEB activity for the first REST API call. The UNTIL loop uses its output to to check whether it contains a property nextLink. If that property is present in the output then there is a next page and this property will contain the REST API URL for it.

The name of the WEB activity is web_GetNotebooks and therefor the expression of the UNTIL is:

@not(
    contains(
            activity('web_GetNotebooks').output
            ,'nextLink'
            )
    )

UNTIL loop

5) Web activity

The first activity in the UNTIL is the WEB activity. It's easier to add this activity to the UNTIL before writing the expression of the step above.
As mentioned above its name is web_GetNotebooks. The important settings are:

URL - @variables('SynapseRestAPI')
Method - GET
Authenication - System-assigned managed identity
Resource - https://dev.azuresynapse.net/

And make sure to select your SHIR in the advanced settings if you use one. Also make sure to give the Synapse workspace at least the Reader role to itself otherwise it can't use its own Rest API.

WEB activity for REST API

6) Union output to temp

Next we need two Set Variable activities in the UNTIL. With the first Set Variable activity we union the output of the Rest API with the value of the Notebooks array variable and store it in the Notebooks_Temp array variable. This has to be a two step task, hence the temp variable. The expression looks like:

@union(
    activity('WEB_GetNotebooks').output.value
    ,variables('Notebooks')
    )

The first iteration the Notebooks variable will still be empty, but for all next iterations it will be filled by the next activity.

Union output to temp array variable

7) Use temp to fill variable

In the previous activity we filled the Notebooks_Temp variable. In this next step we store the value of Notebooks_Temp in the main array variable Notebooks. Then we can use this value to union it in the next iteration.

Store temp value in main variable

8) Determine URL of next Rest API call

The last activity in the UNTIL is to check whether there a next page. If there is we will fill the string variable SynapseRestAPI with the URL and if not we will fill it with an empty string that will break the UNTIL loop.

@if(
    contains(
            activity('WEB_GetNotebooks').output
            , 'nextLink'
            )
    ,activity('WEB_GetNotebooks').output.nextLink
    ,''
    )

Retrieve URL of next page

9) Filter notebook array

The notebook array is now filled with all published notebooks. If you only want certain notebooks from a specific folder then we need to add a FILTER Activity. The path for filtering is retrieved from the pipeline parameter. The startswith will also retrieve notebooks from all subfolders. Replace with EQUALS if you don't want that:

@if(
    not(
        empty(item().properties.folder)
        )
    , startswith(
        item().properties.folder.name
        , pipeline().parameters.NotebookFolderPath
        )
    , false
    )

Filter notebooks by folder path

10) Foreach notebook

Now you can use the output of the FILTER Activity in the FOREACH loop. You should also tune the FOREACH settings to the available Spark node. If you use a small node then you probably won't run 20 notebooks at a time.

Foreach Notebook

11) Execute Notebook

Last step of the solution is executing all notebooks via the Notebook Activity. The notebook name property should be filled with an expression: @item().name

Execute all notebooks

The loop will ofcourse only work if the parameters and settings for each notebook are all equal. Otherwise you will have a lot of expression work to do making the solution probably to difficult to maintain.

Tip

In the User property tab of the Notebook activity you can add a few properties which you can show in the output making it very handy for debugging. In this case there is a NotebookName property with the expression @{item().name} and a FolderPath property with the expression @{item().properties.folder.name}.

Adding User Properties

Now when debugging your pipeline you can add those properies as columns. Click on the icon in the User Property column to see the available columns. This is useful for loops like UNTIL and FOREACH. Now you can forexample instantly see which notebook fails and which one succeeds without checking the input of each Notebook activity.

Showing user properties in Output window

Conclusion

This solution will make it easy to iterate through a whole bunch of notebooks without adding them one by one to a pipeline. You will have to use folders for your notebooks (or a certain notebook naming convention) if you don't want them all to be executed. In development this only works if you first publish the notebooks, because the Rest API only returns published/live notebooks. Don't forget to use the User Properties tip to make debugging a lot easier.

Saturday, 21 September 2024

Synapse - ReferenceError: Response is not defined

Case
My Synapse CICD deployment pipeline is failing with some kind of error pointing to a javascript file main.js: ReferenceError: Response is not defined

ReferenceError: Response is not defined

When using the Synapse Workspace Deployment addon for Azure Devops, it gives me an error (and it did work before) when trying to deploy a Synapase Analytics Workspace:

ReferenceError: Response is not defined

at Module../node_modules/@trident/data-access-artifact/src/lib/models/http.js (D:\a\1\s\downloads\main.js:360446:40)