Microsoft BI Tools

Monday, 11 January 2021

Scaling Azure Dedicated SQL Pools from ADF

Case
Is there a solution to upscale and downscale my Azure Dedicated SQL Pool from the Azure Data Factory pipeline without scripting? I know there are PowerShell solutions, but I rather use a no-code solution. What are my options?

Scaling Azure Dedicated SQL Pools

Solution
Fortunately you can now use the Rest API's of Azure Dedicated SQL Pools (formerly known as Azure SQL Data Warehouse and for a short period as Azure Synapse Analytics) to down- or upscale the compute. So no coding required.

1) Give ADF Access to SQL Pool
To call the Rest API we need to give ADF access to the SQL Pool or more specific to the SQL Server hosting that SQL Pool. We need a role that can only change the database settings, but nothing security related: Contributer, SQL DB Contributer or SQL Server Contributer.

Go to the Azure SQL Server of the SQL Pool that you want to scale up or down with ADF
In the left menu click on Access control (IAM)
Click on Add, Add role assignment
In the 'Role' drop down select 'SQL DB Contributer'
In the 'Assign access to' drop down select Data Factory
Search for your Data Factory, select it and click on Save

Grant data factory SQL DB Contributor role to SQL Server

If you forget this step then you will receive an authorization error while executing your ADF pipeline.

2108 Authorization Failed

{"error":
{"code":"AuthorizationFailed"
,"message":"The client 'xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' with object id 'xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' does not have authorization to perform action 'Microsoft.Sql/servers/databases/resume/action' over scope '/subscriptions/xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/RG_bitools/providers/Microsoft.Sql/servers/SQL_bitools/databases/bitools' or the scope is invalid. If access was recently granted, please refresh your credentials."}
}

2) Determine URL

Now it is almost time to edit your ADF pipeline. The first step will be adding a Web activity to call the Rest API, but before we can do that we need to determine the URL of this API which you can find here.

Scaling

https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/{resource-group-name}/providers/Microsoft.Sql/servers/{server-name}/databases/{database-name}?api-version=2014-04-01-preview

Within this URL you need to replace all parts that start and end with a curly bracket: {subscription-id}, {resource-group-name}, {server-name} and {database-name} (including the brackets themselves). Don't use a URL (bitools.database.windows.net) for the database server name, but use only the name: bitools.

Example URL

https://management.azure.com/subscriptions/aaaa-bbbb-1234-cccc/resourceGroups/RG_Bitools/providers/Microsoft.Sql/servers/bitools2/databases/bitools?api-version=2014-04-01-preview

Example URL

3) JSON message for Rest API

The Rest API above expects a JSON message with the pricing tier. A list of all pricing tiers can be found here in the column 'Data warehouse units'. Here are two example which you need to adjust for your requirements:

{
    "properties": {
        "requestedServiceObjectiveName": "DW200c"
    }
}

{
    "properties": {
        "requestedServiceObjectiveName": "DW1000c"
    }
}

Note: Just in case you get one of these errors below. The json example in the documentation is incorrect at the moment of writing:

Quotation marks around the data warehouse units are missing, which returns the following error:{"error":{"code":"InvalidRequestContent","message":"The request content was invalid and could not be deserialized: 'Unexpected character encountered while parsing value: D. Path 'properties.requestedServiceObjectiveName', line 3, position 41.'."}}
The c is missing after the data warehouse units (gen1 vs gen2) which returns the following error:
{"code":"45122","message":"\u0027Azure SQL Data Warehouse Gen1 has been deprecated in this region. Please use SQL Analytics in Azure Synapse.\u0027","target":null,"details":[{"code":"45122","message":"\u0027Azure SQL Data Warehouse Gen1 has been deprecated in this region. Please use SQL Analytics in Azure Synapse.\u0027","target":null,"severity":"16"}],"innererror":[]}

4) Add Web Activity
To call the Rest API we will use the Web Activity in the ADF pipeline. The actual Rest API call is synchronous, which means it waits for it to finish the pause or resume action and then it will return a message. This also means that you don't have to build any checks to make sure it is already online.

Add a Web activity to your pipeline and give it a suitable name
Go to the Settings tab and use the URL from step 2 in the URL property
Choose PATCH as method
Add a new header with the name 'Content-Type' and the value 'application/json'
Fill in the JSON message from step 3 as body
Choose MSI as authentication method
As the last step enter this URL https://management.azure.com/ as Resource

Use a Web activity to call the Rest API

If you want to scale up before your ETL/ELT process and scale down afterwards then you need two separate Web activities or one clever child pipeline with parameters that you execute from your main pipeline.

Summary

In this post you learned how to upscale and downscale your Dedicated SQL Pool to save some money on your Azure bill without writing any code. Note that at the moment of writing live scaling is not yet available and that you will loose the connection to your Dedicated SQL Pool for a couple of minutes.

Also note that ADF pipelines slightly differ from Azure Synapse Analytics pipelines. So if you consider switching to Synapse workspaces because you apparently already use Dedicated SQL Pools then you have to make some small adjustments to this specific task which will be described in a next post. In an other post we already showed how to pause and resume your Azure SQL Pools from within ADF.

Wednesday, 6 January 2021

Pausing and resuming Dedicated SQL Pools from ADF

Case
Is there a solution to pause and resume an Azure Dedicated SQL Pool from the Azure Data Factory pipeline without scripting? I know there are PowerShell solutions, but I rather use a no-code solution. What are my options?


Pause and resume Azure Dedicated SQL Pools

Solution
Luckily you can now use the Rest API's of Azure Dedicated SQL Pools (formerly known as Azure SQL Data Warehouse and for a short period as Azure Synapse Analytics) to pause or resume the compute.

Go to the Azure SQL Server of the SQL Pool that you want to pause or resume with ADF
In the left menu click on Access control (IAM)
Click on Add, Add role assignment
In the 'Role' drop down select 'SQL DB Contributor'
In the 'Assign access to' drop down select Data Factory
Search for your Data Factory, select it and click on Save

Grant data factory SQL DB Contributor role to SQL Server

If you forget this step then you will receive an authorization error while executing your ADF pipeline.

2108 Authorization Failed

{"error":
{"code":"AuthorizationFailed"
,"message":"The client 'xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' with object id 'xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' does not have authorization to perform action 'Microsoft.Sql/servers/databases/resume/action' over scope '/subscriptions/xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/RG_bitools/providers/Microsoft.Sql/servers/SQL_bitools/databases/bitools' or the scope is invalid. If access was recently granted, please refresh your credentials."}
}

2) Determine URL

Pause compute

Resume compute

Within these URL's you need to replace all parts that start and end with a curly bracket: {subscription-id}, {resource-group-name}, {server-name} and {database-name} (including the brackets themselves). Don't use a URL (bitools.database.windows.net) for the database server name, but use only the name: bitools.

Example URL

https://management.azure.com/subscriptions/aaaa-bbbb-1234-cccc/resourceGroups/RG_Bitools/providers/Microsoft.Sql/servers/bitools2/databases/bitools/resume?api-version=2014-04-01-preview

Example URL

3) Add Web Activity
To call the Rest API we will use the Web Activity in the ADF pipeline. The actual Rest API call is synchronous, which means it waits for it to finish the pause or resume action and then it will return a message. This also means that you don't have to build any checks to make sure it is already online.

Add a Web activity to your pipeline and give it a suitable name
Go to the Settings tab and use the URL from the previous step in the URL property
Choose POST as method
Fill in {} as body (we don't need it, but it is required)
Choose MSI as authentication method
As the last step enter this URL https://management.azure.com/ as Resource

Use a Web activity to call the Rest API

You could also first check the current status via the Check database state Rest API and then use an expression like @activity('Get Status').output.properties.status to retrieve the current state.

Summary

In this post you learned how to pause and resume your Dedicated SQL Pool to save some money on your Azure bill without writing any code. Note that you only pause the compute and that you still have to pay for the storage of the SQL Pool.

Also note that ADF pipelines slightly differ from Azure Synapse Analytics pipelines. So if you consider switching to Synapse workspaces because you apparently already use Dedicated SQL Pools then you have to make some small adjustments to this specific task which will be described in a next post. In an other post we will also show how to scale your Azure SQL Pools from within ADF.

Monday, 4 January 2021

Power Apps Snack: Don't repeat yourself (again)

Case
I have some pieces of code in my Power Apps formula that gets repeated multiple times. I don't want to repeat that same piece of code, not only because that is dull, but also because it's very maintenance-sensitive? A simple change in the formula needs to be adjusted in all repeated pieces of code. Is there a better solution?

Use the With function to not repeat yourself

Solution
Yes there is a better solution then repeating the same piece of code over and over. Last year we showed you an option with a custom 'method', but there is also the WITH function that you probably already know from other languages like T-SQL.

For this example we will revisit an old blogpost to get all selected items from a combobox in a label:

Mulitselect ComboBox in Power Apps

The basic solution is using a concat and concatenate function which looks like:

Concat(
    ComboBox_multiselect.SelectedItems.ProductKey,
    Concatenate(
        Text(ProductKey),
        ","
    )
)

This will result in a string with "PrdKey1,PrdKey2,PrdKey3," If you want to get ride of the last comma you can use a left: LEFT(mystring, LEN(mystring) - 1). This is where the code repeating starts:

Left(
    Concat(
        ComboBox_multiselect.SelectedItems.ProductKey,
        Concatenate(
            Text(ProductKey),
            ","
        )
    ),
    Len(
        Concat(
            ComboBox_multiselect.SelectedItems.ProductKey,
            Concatenate(
                Text(ProductKey),
                ","
            )
        )
    ) - 1
)

The trick is to store the result of the code that needs to be repeated in a 'variable' called myProducts with the With function. After that you can use that 'variable' so that you don't need to repeat yourself:

With(
    {
        myProducts: Concat(
            ComboBox_multiselect.SelectedItems.ProductKey,
            Concatenate(
                Text(ProductKey),
                ","
            )
        )
    },
    Left(
        myProducts,
        Len(myProducts) - 1
    )
)

Conclusion
In this post you learned how NOT to repeat yourself by using the With function in Power Apps. Perhaps the code reduction of this basic example is not that big, but we all know that those small pieces of code can quickly get out of control.

Tuesday, 29 December 2020

ADF Snack: get files of last hour

Case
I have a Data Factory pipeline that should run each hour and collect all new files added to the data lake since the last run. What is the best activity or do we need to write code?

No scripts, no loops

Solution
The Copy Data Activity has a wildcard filter which allows you to read multi files (of the same type/format) at once. So no need for a ForEach Activity to process multiple files at once. Combine that with the start- and enddate filter option within that same Copy Data Activity and you can limit the files to a certain period.

Date filter

The End datetime property will be populated with the start-datetime of the current ADF pipeline. So files added during the run of the pipeline will be skipped and processed during the next run. This End datetime will also be stored afterwards for the next run.

The Start datetime will be retrieved from a database table. The previous run of the pipeline stored its End datetime as the Start datetime for the next run.

The basic setup

1) Table and stored procedures

To store (and retrieve) the datetime from the pipeline we use a database table and some Stored Procedures. To keep it easy we kept it very very basic. Feel free to extend it to your own needs. In this solution there will be only one record per source. The SetLastRun stored procedure will either insert a new record or update the existing record via a MERGE statement. The GetLastRun stored procedure will retrieve the datetime of the last run and return a default date if there is no record available.

-- Create runs table
CREATE TABLE [dbo].[Runs](
	[SourceName] [nvarchar](50) NOT NULL,
	[LastRun] [datetime2](7) NULL,
	CONSTRAINT [PK_Runs] PRIMARY KEY CLUSTERED 
	(
	[SourceName] ASC
	)
)

-- Save the lastrundate
CREATE PROCEDURE SetLastRun
(
    @SourceName as nvarchar(50)
,	@LastRun as datetime2(7)
)
AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON

	-- Check if there already is a record
	-- Then Insert of Update a record
    MERGE dbo.Runs AS [Target]
    USING (SELECT @SourceName, @LastRun) AS [Source] ([SourceName], [LastRun])  
    ON ([Target].[SourceName] = [Source].[SourceName])  
    WHEN MATCHED THEN
        UPDATE SET [LastRun] = [Source].[LastRun]  
    WHEN NOT MATCHED THEN  
        INSERT ([SourceName], [LastRun])  
        VALUES ([Source].[SourceName], [Source].[LastRun]);  
END
GO

-- Retrieve the lastrundate
CREATE PROCEDURE GetLastRun
(
    @SourceName as nvarchar(50)
)
AS
BEGIN
	DECLARE @DefaultDate as datetime2(7) = '1-1-1900'

	-- Retrieve lastrun and provide default date in case null
	SELECT	ISNULL(MAX([LastRun]), @DefaultDate) as [LastRun]
	FROM	[Runs] 
	WHERE	[SourceName] = @SourceName
END
GO

2) Retrieve datetime last run

So the first step in the pipeline is to execute a stored procedure that retrieves the End datetime of the previous run. As mentioned a default datetime will be returned if there is no previous run available. Due the lack of getting output parameters we will use the Lookup activity instead of the Stored Procedure activity.

Add a Lookup activity to the pipeline and give it a descriptive name
On the Settings tab add or reuse a Source dataset (and Linked service) that points to the database containing the table and store procedures of the previous step (don't point to a specific table).
Choose Stored Procedure under the Use query property
Select 'GetLastRun' as Stored procedure name and hit the Import button to get the paramaters from the stored procedure
Now either use a hardcoded source name or use an expression like @pipeline().Pipeline to for example use the pipeline name as source.

Execute Stored Procedure via Lookup to retrieve last rundate

3) Copy Data Activity

The second step is to retrieve the actual data from the data lake with a Copy Data activity. With two expressions it will first retrieve the datetime of the previous step and use it as the starttime filter and secondly retrieve the Start datetime of the pipeline itself and use that as Endtime filter.

Add the Copy Data Activity and set it up to load a specific file from the data lake to a SQL Server table (or your own destination)
Now on the Source tab change the File path type to Wildcard file path
Then set the Wildcard file name to for example read all CSV files with *.csv instead of a specific file.
Next set the Start time (UTC) property under Filter by last modified to the following expression:
@activity('Get Last Run').output.firstRow.LastRun. Where the yellow marked text is the name of the previous task and the green marked text is the output of the Stored Procedure (more details here).
Also set the End time (UTC) property with the following expression:
@pipeline().TriggerTime (this will get the actual starttime of the pipeline)
You also might want to add an extra metadata column with the Filename via the Additional columns option (more details here).

Set up wildcard and datetime filters

4) Store datetime for next run

The last step is to save the Start datetime of the pipeline itself as run datetime so that it can be retrieved in the next run. Since this Stored Procedure doesn't have any output parameters we can use the standard Stored Procedure Activity.

Add the Stored Procedure activity and connect it to the previous activity
On the Settings tab reuse the same Linked service as in step 2
Select SetLastRun as the Stored procedure name
Hit the import button and set the parameters

LastRun should be filled with the startdatetime of the pipeline: @pipeline().TriggerTime
SourceName should be filled with the same expression as in step 2

Add Stored Procedure to save rundate

5) Schedule

Now just schedule your pipeline every x minutes or x hours with a trigger to keep your database table up-to-date with files from the data lake. Then keep adding files to the data lake and watch your runs table (step 1) and the actual staging table to see the result. The optional metadata column of step 3 should make debugging and testing a lot easier.

Summary

In this post you learned how to use the wildcard and filter option of the Copy Data activity to create a mechanism to keep your data up-to-date. A downside of this solution is that it will sometimes run the pipeline unnecessarily because no new files where added to the data lake. An other downside is that the process is not realtime.

If you need a more (near-)realtime solution instead of running every x minutes or hours then you can use the trigger solution. Then you process files as soon as they arrive. However that solution has two downsides. First of all you are running the pipeline for each file. Which means you are paying for each file. Secondly there is a limit for the number of files that can be triggered per hour as specially when you don't want (or can't) process files in parallel. The execution queue has a limit of 100 executions per pipeline. After that you will receive an error and miss that file.

Sunday, 27 December 2020

ADF Snack: use stored procedure output

Case
I want to execute a Stored Procedure in Azure Data Factory and use its output further on in the pipeline. The stored procedure activity does not return the output. How do you accomplish that?

Stored Procedure Activity not working for output

Solution
At the moment the Stored Procedure Activity in ADF doesn't handle output parameters, but there is a workaround.

1) Alter Stored Procedure

ADF can't handle output parameters, but you can add a SELECT statement at the end to return the value. Make sure to add an alias, but also make sure to only return one row in your SELECT query.

-- Alter Stored Procedure
ALTER PROCEDURE AddNumbers
(
    -- Parameters
    @number1 as int
,	@number2 as int
,	@Result as int output
)
AS
BEGIN
	-- Do the math
    set @Result = @number1 + @number2
	-- For ADF
	select @Result as Result
END

2) Use Lookup Activity instead

Now instead of using the Stored Procedure Activity we will be using the Lookup Activity.

Add a Lookup activity to the pipeline and give it a descriptive name
On the Settings tab, add a new Source Data pointing to the database with the Stored Procedure. Leave the table property of the dataset empty (we will use a Stored Procedure instead).
After adding the Source dataset, choose Stored Procedure under the Use query property
Next select your Stored Procedure
Import the parameters
Set the parameter values (output parameters that are not used by ADF should be Treat as null
Make sure the option First row only is enabled
Now debug your pipeline and compare the output of the Lookup activity with the Stored Procedure activity. In the next step we will show you the expression to retrieve this value.

Using the Lookup activity to execute a Stored Procedure

3) Getting the output of the Stored Procedure

Now with an expression in a next activity you can retrieve the output value of the Stored Procedure executed in the Lookup activity: @activity('Do some more math').output.firstRow.Result

The yellow part is the name of the Lookup activity and the green part is the result:

Compare the red part with the expression

If you want to store the result in a pipeline variable then you might need to add a type conversion @string(activity('Do some more math').output.firstRow.Result)

Convert int to string

Summary

The Stored Procedure activity is pretty useless when trying to get some output from a Stored Procedure, but there is a simple workaround with the Lookup activity. In the next blog post we will show an other example of using the output of a stored procedure within ADF.