Tuesday, 29 December 2020

ADF Snack: get files of last hour

Case
I have a Data Factory pipeline that should run each hour and collect all new files added to the data lake since the last run. What is the best activity or do we need to write code?
No scripts, no loops





Solution
The Copy Data Activity has a wildcard filter which allows you to read multi files (of the same type/format) at once. So no need for a ForEach Activity to process multiple files at once. Combine that with the start- and enddate filter option within that same Copy Data Activity and you can limit the files to a certain period.

Date filter
The End datetime property will be populated with the start-datetime of the current ADF pipeline. So files added during the run of the pipeline will be skipped and processed during the next run. This End datetime will also be stored afterwards for the next run.
The Start datetime will be retrieved from a database table. The previous run of the pipeline stored its End datetime as the Start datetime for the next run.
The basic setup





1) Table and stored procedures
To store (and retrieve) the datetime from the pipeline we use a database table and some Stored Procedures. To keep it easy we kept it very very basic. Feel free to extend it to your own needs. In this solution there will be only one record per source. The SetLastRun stored procedure will either insert a new record or update the existing record via a MERGE statement. The GetLastRun stored procedure will retrieve the datetime of the last run and return a default date if there is no record available.
-- Create runs table
CREATE TABLE [dbo].[Runs](
	[SourceName] [nvarchar](50) NOT NULL,
	[LastRun] [datetime2](7) NULL,
	CONSTRAINT [PK_Runs] PRIMARY KEY CLUSTERED 
	(
	[SourceName] ASC
	)
)

-- Save the lastrundate
CREATE PROCEDURE SetLastRun
(
    @SourceName as nvarchar(50)
,	@LastRun as datetime2(7)
)
AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON

	-- Check if there already is a record
	-- Then Insert of Update a record
    MERGE dbo.Runs AS [Target]
    USING (SELECT @SourceName, @LastRun) AS [Source] ([SourceName], [LastRun])  
    ON ([Target].[SourceName] = [Source].[SourceName])  
    WHEN MATCHED THEN
        UPDATE SET [LastRun] = [Source].[LastRun]  
    WHEN NOT MATCHED THEN  
        INSERT ([SourceName], [LastRun])  
        VALUES ([Source].[SourceName], [Source].[LastRun]);  
END
GO

-- Retrieve the lastrundate
CREATE PROCEDURE GetLastRun
(
    @SourceName as nvarchar(50)
)
AS
BEGIN
	DECLARE @DefaultDate as datetime2(7) = '1-1-1900'

	-- Retrieve lastrun and provide default date in case null
	SELECT	ISNULL(MAX([LastRun]), @DefaultDate) as [LastRun]
	FROM	[Runs] 
	WHERE	[SourceName] = @SourceName
END
GO

2) Retrieve datetime last run
So the first step in the pipeline is to execute a stored procedure that retrieves the End datetime of the previous run. As mentioned a default datetime will be returned if there is no previous run available. Due the lack of getting output parameters we will use the Lookup activity instead of the Stored Procedure activity.
  • Add a Lookup activity to the pipeline and give it a descriptive name
  • On the Settings tab add or reuse a Source dataset (and Linked service) that points to the database containing the table and store procedures of the previous step (don't point to a specific table).
  • Choose Stored Procedure under the Use query property
  • Select 'GetLastRun' as Stored procedure name and hit the Import button to get the paramaters from the stored procedure
  • Now either use a hardcoded source name or use an expression like @pipeline().Pipeline to for example use the pipeline name as source.
Execute Stored Procedure via Lookup to retrieve last rundate



















3) Copy Data Activity
The second step is to retrieve the actual data from the data lake with a Copy Data activity. With two expressions it will first retrieve the datetime of the previous step and use it as the starttime filter and secondly retrieve the Start datetime of the pipeline itself and use that as Endtime filter.
  • Add the Copy Data Activity and set it up to load a specific file from the data lake to a SQL Server table (or your own destination)
  • Now on the Source tab change the File path type to Wildcard file path
  • Then set the Wildcard file name to for example read all CSV files with *.csv instead of a specific file.
  • Next set the Start time (UTC) property under Filter by last modified to the following expression:
    @activity('Get Last Run').output.firstRow.LastRun. Where the yellow marked text is the name of the previous task and the green marked text is the output of the Stored Procedure (more details here).
  • Also set the End time (UTC) property with the following expression:
    @pipeline().TriggerTime (this will get the actual starttime of the pipeline)
  • You also might want to add an extra metadata column with the Filename via the Additional columns option (more details here).
Set up wildcard and datetime filters





















4) Store datetime for next run
The last step is to save the Start datetime of the pipeline itself as run datetime so that it can be retrieved in the next run. Since this Stored Procedure doesn't have any output parameters we can use the standard Stored Procedure Activity.
  • Add the Stored Procedure activity and connect it to the previous activity
  • On the Settings tab reuse the same Linked service as in step 2
  • Select SetLastRun as the Stored procedure name
  • Hit the import button and set the parameters
    • LastRun should be filled with the startdatetime of the pipeline: @pipeline().TriggerTime
    • SourceName should be filled with the same expression as in step 2
Add Stored Procedure to save rundate
















5) Schedule
Now just schedule your pipeline every x minutes or x hours with a trigger to keep your database table up-to-date with files from the data lake. Then keep adding files to the data lake and watch your runs table (step 1) and the actual staging table to see the result. The optional metadata column of step 3 should make debugging and testing a lot easier.


Summary
In this post you learned how to use the wildcard and filter option of the Copy Data activity to create a mechanism to keep your data up-to-date. A downside of this solution is that it will sometimes run the pipeline unnecessarily because no new files where added to the data lake. An other downside is that the process is not realtime.

If you need a more (near-)realtime solution instead of running every x minutes or hours then you can use the trigger solution. Then you process files as soon as they arrive. However that solution has two downsides. First of all you are running the pipeline for each file. Which means you are paying for each file. Secondly there is a limit for the number of files that can be triggered per hour as specially when you don't want (or can't) process files in parallel. The execution queue has a limit of 100 executions per pipeline. After that you will receive an error and miss that file.

Sunday, 27 December 2020

ADF Snack: use stored procedure output

Case
I want to execute a Stored Procedure in Azure Data Factory and use its output further on in the pipeline. The stored procedure activity does not return the output. How do you accomplish that?
Stored Procedure Activity not working for output
























Solution
At the moment the Stored Procedure Activity in ADF doesn't handle output parameters, but there is a workaround.

1) Alter Stored Procedure
ADF can't handle output parameters, but you can add a SELECT statement at the end to return the value. Make sure to add an alias, but also make sure to only return one row in your SELECT query.
-- Alter Stored Procedure
ALTER PROCEDURE AddNumbers
(
    -- Parameters
    @number1 as int
,	@number2 as int
,	@Result as int output
)
AS
BEGIN
	-- Do the math
    set @Result = @number1 + @number2
	-- For ADF
	select @Result as Result
END

2) Use Lookup Activity instead
Now instead of using the Stored Procedure Activity we will be using the Lookup Activity.
  • Add a Lookup activity to the pipeline and give it a descriptive name
  • On the Settings tab, add a new Source Data pointing to the database with the Stored Procedure. Leave the table property of the dataset empty (we will use a Stored Procedure instead).
  • After adding the Source dataset, choose Stored Procedure under the Use query property
  • Next select your Stored Procedure
  • Import the parameters
  • Set the parameter values (output parameters that are not used by ADF should be Treat as null
  • Make sure the option First row only is enabled
  • Now debug your pipeline and compare the output of the Lookup activity with the Stored Procedure activity. In the next step we will show you the expression to retrieve this value.
Using the Lookup activity to execute a Stored Procedure














3) Getting the output of the Stored Procedure
Now with an expression in a next activity you can retrieve the output value of the Stored Procedure executed in the Lookup activity: @activity('Do some more math').output.firstRow.Result
The yellow part is the name of the Lookup activity and the green part is the result:
Compare the red part with the expression
















If you want to store the result in a pipeline variable then you might need to add a type conversion @string(activity('Do some more math').output.firstRow.Result)

Convert int to string





















Summary
The Stored Procedure activity is pretty useless when trying to get some output from a Stored Procedure, but there is a simple workaround with the Lookup activity. In the next blog post we will show an other example of using the output of a stored procedure within ADF.

Saturday, 14 November 2020

Process Analysis Service with Firewall in ADF only

Case
How do I process my Azure Analysis Services (AAS) within Azure Data Factory (ADF) when the firewall of AAS is turned on? I 'm getting the following error:

Cannot connect to server 'bitools'. Client with IP Address '40.74.24.250' is not allowed to access the server. To enable access, use the Firewall settings in Azure Management Portal. It may take up to 5 minutes for this change to take effect. More information on configuring firewall and setting IP address ranges can be found here: https://go.microsoft.com/fwlink/?linkid=2099963

Azure Anaylis Services with firewall turned on




















Solution
Within ADF you have two options. The first option is to add the IP addresses of ALL data factories within your region to the firewall of AAS. You can download a list and add those IP addresses to the firewall. You could even schedule a powershell script to do that for you to keep your firewall up-to-date in case the list changes. A downside could be that now a lot of data factories can access your AAS.

The second option is to retrieve the public IP address of your current ADF during runtime and then add it temporarily to the firewall of AAS. Note that this is a first attempt which still needs some improvement and testing to make it future proof. First a picture of the solution and then detailed steps of each activity.
The solution













1) Add ADF as contributer to AAS
Since we will chance the service settings of AAS from within ADF, we will need to use Access control (IAM) to make our ADF a contributor for the AAS where we want to change the firewall. Note that this step will not give ADF access to the models, but only to the service.
  • Go to your AAS in the Azure portal
  • In the left menu click on Access control (IAM)
  • Click on + Add and choose Add role assignment
  • In the new Add role assignment pane select Contributor as Role
  • In the Assign access to dropdown select Data Factory
  • Select the right Subscription
  • And then Select your Data Factory and click on the Save button
Add ADF as Contributor to AAS














2) Get Current Public IP Address of ADF
Now back to ADF. You can execute a webrequest to one of the many public services to retrieve the public IP address of the ADF server that runs your pipeline:
  • https://icanhazip.com
  • https://ident.me
  • https://ifconfig.me/ip
  • https://ipinfo.io/ip
  • http://smart-ip.net/myip
Now the steps to take:
  • Create a new pipeline in ADF and add a Web activity
  • Give the activity a descriptive name because we need it in a subsequent activity
  • Add one of the above web addresses as URL on the Settings tab
  • The last step is to select GET as method
In a subsequent activity that is connected to this Web activity we use the follow expression to retrieve the IP Address: @activity('Get Current Public IP Address').output.Response
Get Public IP address of ADF
























3) Get Current AAS settings
The Rest API that sets the firewall settings of AAS can unfortunately not add a single firewall rule. Instead it will delete all rules and then add news ones. To avoid other services or persons losing access to the AAS server we first need to retrieve the current settings. This is done via a Rest API. Therefore the first step is to determine the Rest API URL. Replace in the string below, the <xxx> values with the subscription id, resourcegroup name and servername of your Analysis Services. The Rest API method we will be using is 'Get Details':
https://management.azure.com/subscriptions/<xxx>/resourceGroups/<xxx>/providers/Microsoft.AnalysisServices/servers/<xxx>?api-version=2017-08-01
Example:
https://management.azure.com/subscriptions/aaaa-bbbb-123456-cccc/resourceGroups/RG_production/providers/Microsoft.AnalysisServices/servers/bitoolsserver?api-version=2017-08-01

Add a second Web activity to the pipeline
  • Give the activity a descriptive name because we need it in a subsequent activity
  • Add the above web address as URL on the Settings tab
  • Next select GET as method
  • Under Advanced set Authentication to MSI
  • Set Resource to https://management.azure.com/
The Rest API will return a JSON message with ALL the settings of the service, including the firewall rules. In a next step we can retrieve these settings with the following expression:
@STRING(activity('Get Current AAS settings').output)

Retrieve AAS Service settings
























4) Merge current firewall settings with new ip address
In this step we will merge the current firewall settings (from step 3) with the runtime ip address of ADF (from step 2). Unfortunately the expression language of ADF is not that extensive (yet). For example the indexof expression does not have a search starting position. Therefore for now we have chosen to merge those two with a little Stored Procedure. But please share your suggestion for alternatives in the comments.

First add the following Stored Procedure to one of your Azure SQL databases. It will return both the current firewall setting part of the entire JSON and the new adjusted firewall part. That first one can be used to restore the settings as final step.
T-SQL Stored Procedure
CREATE PROCEDURE [dbo].[UpdateIpV4FirewallSettings]
(
	@jsonmessage nvarchar(max)
,	@firewallRuleName nvarchar(50)
,	@ipAddress nvarchar(20) 
)
AS
BEGIN
	-- =============================================
	-- Recieves the current AAS settings as a JSON
	-- message and extracts the firewall settings
	-- from it. Then the firerullname and the ip
	-- address are added.
	--
	-- The select query returns the new and old
	-- firewall settings to update and restore
	-- afterwards.
	-- =============================================

	SELECT	CAST(
				SUBSTRING(@jsonmessage, 1, 15) + -- properties
				SUBSTRING(@jsonmessage, CHARINDEX('"ipV4FirewallSettings":',@jsonmessage , 1), CHARINDEX(']',@jsonmessage , CHARINDEX('"ipV4FirewallSettings":',@jsonmessage , 1)) - CHARINDEX('"ipV4FirewallSettings":',@jsonmessage , 1)) + -- oldfirewallrules
				iif(CHARINDEX('"firewallRuleName":"' + @firewallRuleName, @jsonmessage , 1) > 0 OR CHARINDEX(@ipAddress, @jsonmessage , 1) > 0,'',',{"firewallRuleName":"' + @firewallRuleName +'","rangeStart":"' + @ipAddress + '","rangeEnd":"' + @ipAddress + '"}') + -- newfirewallrules
				SUBSTRING(@jsonmessage,CHARINDEX('],"enablePowerBIService":',@jsonmessage , 1), CHARINDEX('}',@jsonmessage , CHARINDEX('],"enablePowerBIService":',@jsonmessage , 1)) - CHARINDEX('],"enablePowerBIService":',@jsonmessage , 1) + 1) + -- powerbirules
				'}}' -- end
			as nvarchar(4000))
			as NewFirewallSettings
	,		CAST(
				SUBSTRING(@jsonmessage, 1, 15) + -- properties
				SUBSTRING(@jsonmessage, CHARINDEX('"ipV4FirewallSettings":',@jsonmessage , 1), CHARINDEX(']',@jsonmessage , CHARINDEX('"ipV4FirewallSettings":',@jsonmessage , 1)) - CHARINDEX('"ipV4FirewallSettings":',@jsonmessage , 1)) + -- oldfirewallrules
				SUBSTRING(@jsonmessage,CHARINDEX('],"enablePowerBIService":',@jsonmessage , 1), CHARINDEX('}',@jsonmessage , CHARINDEX('],"enablePowerBIService":',@jsonmessage , 1)) - CHARINDEX('],"enablePowerBIService":',@jsonmessage , 1) + 1) + -- powerbirules
				'}}' -- end
			as nvarchar(4000))
			as oldFirewallSettings
END

Now back to the pipeline to merge the json message and the ip address. We will solve this by calling the above Stored Procedure within a Lookup activity.
  • Add the Look up activity and give it a descriptive name (we need it in the next step)
  • Connect both previous activities to this new activity (in parallel, see screenshot)
  • Go to the settings tab and select the (/ create a) Source dataset that connects to the Azure SQL Database that contains the above Stored Procedure
  • Select Stored Procedure under Use query
  • Select the new Stored Procedure under Name
  • Hit the Import parameter button to add three string parameters
  • Enter the following expressions
    • firewallRuleName - @concat('ADF-',pipeline().DataFactory)
    • ipAddress - @activity('Get Current Public IP Address').output.Response
    • jsonmessage - @STRING(activity('Get Current AAS settings').output)
The output of this activity will be use in the next activity:
@activity('IpV4FirewallSettings Message').output.firstRow.NewFirewallSettings
@activity('IpV4FirewallSettings Message').output.firstRow.OldFirewallSettings
Lookup Activity calling Stored Procedure























5) Update AAS Firewall
In this step we will update the firewall settings of AAS to allow ADF to process the tabular model. This is done via a Rest API call of the Update method within a Web activity. This is exactly the same URL as step 3. Copy it

Add an other Web activity to the pipeline and connect it to the last activity (see screenshot).
  • Give the activity a descriptive name
  • Add the web address from step 3 as URL on the Settings tab
  • Next select PATCH as method (instead of GET like in step 3)
  • Click on + New after Headers to add a new header
    • Use Content-type as Name
    • Use application/json as Value
  • As body we will use the NewFirewallSettings from the previous step:
    @activity('IpV4FirewallSettings Message').output.firstRow.NewFirewallSettings
  • Under Advanced set Authentication to MSI
  • Set Resource to https://management.azure.com/
Update AAS Firewall
























6) Process AAS model
Finally we can do the actual processing of the tabular model. Since we all ready described this in detail in a previous post you have to take a little trip to that post. Make sure to connect the new Web activity to the last activity in the pipeline (see screenshot).
Process AAS tabular model




















7) Update AAS Firewall
The last step is to restore the firewall settings so this ADF server will no longer have access to your AAS. The steps are almost identical to step 5 with two differences. Copy and paste that Web activity and connect it to the process activity, but then change the green Success line to a blue completion line. Then it will always restore the firewall settings even if the process fails.
Change the dependency condition between the activities















Next change  is the Body property on the Settings pane. Change it to retrieve the OLD firewall settings:
@activity('IpV4FirewallSettings Message').output.firstRow.OldFirewallSettings
Restore AAS firewall


















Summary
In this post you saw a (experimental) solution to handle the firewall problem: adding the IP address of a shared service/server to your AAS firewall. This is especially necessary when your service doesn't support VNET (like AAS).

There is still a lot of room for improvements. In particular the merge part in step 4. An other concern is that the activities of one pipeline could be executed by multiple ADF servers (each having there own ip address). 

Feel free to experiment with this solution and leave your suggestions for improvements in the comments below. However don't use it in a production environment... or do it at your own risk.

I a feature post we will download the JSON file with Azure IP Address and at those to the firewall.



Sunday, 1 November 2020

ADF Snack: Adding metadata columns

Case
How do I add (and populate) metadata columns to my stage tables from within Azure Data Factory without executing a Stored Procedure with an update query afterwards?
Adding metadata columns in ADF












Solution
The Copy Data activity has on the Source pane a section called Additional columns that allows you to create extra columns and populate them with values from Data Factory like the run id, run date, pipeline name and file path. It also allows you to copy columns (f.e. to populate two destination columns with the same source value), but It does not allow you to add fancy stuff like a hash of all columns. For the fancy ETL stuff you should take a look at Mapping Data Flows or solve it afterwards with that Stored Procedure activity and some TSQL code.

1) Copy data acticity
For this example we will use a CSV file as source and add a couple of new metadata columns that are not available in the CSV file itself. We will also copy one of the existing columns to a new column. This is very useful if you want to populate two destinations columns with the value from one source column.
  1. Go to your existing Copy Data Activity and then to the Source pane.
  2. At the bottom you will find the Additional columns section. Click on the + New icon to add new columns.
  3. The first column is using an out of the box value by selecting $$FILEPATH in the value column. It retrieves the filename of the source file. Note that this is of course only available when your source is a file.
  4. The second column is using the column copy option. Select $$COLUMN in the value column and then in the third column the column name from the file that you want to copy. If no existing columns are available then hit the Refresh icon in the Additional columns section (or check the dataset).
  5. The third column is using a hardcoded text describing the source application where the file came from.
  6. The next three columns are filled via the dynamic content option to add values from Data Factory. In this example @pipeline().RunId, @pipeline().TriggerTime and @pipeline().Pipeline. You could also add columns with values from parameters or variables.
Adding six additional (metadata) column





















2) The result
To see the result of adding the additional columns click on the Preview data icon. Next you can optionally change the values of the expressions. To see the final preview click on the OK button.
Preview of the Additional columns
















Summary
In this post you learned how to add extra columns with metadata information without using UPDATE queries in an Stored Procedure Activity. The options are a bit basic, but with for example the Get Metadata activity you could retrieve additional information from the file like the last modified date or file size and add that as an Additional column as well.
Adding additional info from Get Metadata activity




Thursday, 15 October 2020

Synapse snack: No new Azure SQL Database allowed

Case
I'm creating a new Synapse Analytics Workspace but it throws an error when deploying it: Location '' is not accepting creation of new Windows Azure SQL Database servers at this time.
Location '' is not accepting creation of new
Windows Azure SQL Database servers at this time.
(Code: SqlServerRegionDoesNotAllowProvisioning)










Solution
This doesn't mean that all Azure SQL Server are sold-out in your Azure region and that you have to wait until Microsoft buys more servers. It means SQL Server is not registered as an allowed resource provider on your Azure Subscription.

This will probably not happen very often because people who want to use Synapse have almost certainly used an Azure SQL Server before, but on a new subscription it is not registered by default.

When creating a new Synapse Workspace it does ask to first register the Microsoft.Synapse resource provider, but it doesn't mention that it also requires Microsoft.Sql.
The Synapse resource provider needs to be registered
to this subscription
















Register Resource Providers
The solution is very simple, but it does require owner permissions on the Azure Subscription.
  • In the Azure Portal go to Subscriptions (search for Subscriptions if it's not on the dashboard or in the menu)
  • Click on your Subscription if you have multiple subscriptions to go to the overview page of your subscription
  • In the left menu (at the bottom) click on Resource Providers
  • Search for SQL
  • Select Microsoft.SQL and click on the register button
Register a Resource Provider




















Registering will take a couple of minutes and after that you can add a Synapse Workspace without the above error. One last observation... it doesn't create any (visible) Azure SQL Servers in any of the resource groups.
Deployment of Synapse Workspace succeeded


Tuesday, 13 October 2020

Execute pipelines from an other Data Factory Sync

Case
Can I use the Execute Pipeline Activity to execute pipelines from an other Data Factory? Or do we need an other activity for this?
Can you use the Execute Pipeline Activity for this?
























Solution
First of all why would you like to have multiple Data Factories other then for a DTAP-street (Development, Testing, Acceptance and Production). There are several reasons, for example:
  • Different departments/divisions each having their own Data Factory
    For example to prevent changes to your pipelines by users of a different department or to make it easier to split the Azure consumption between two divisions.
  • Different regions for international companies
    Either due legal reasons where for example the data may not leave the European Union
    Or to prevent paying unnecessary outbound data costs when your data is spread over different regions
  • Security reasons
    To prevent others to use the access provided via the Managed Identity of ADF. If you give your ADF access via MSI to an Azure Key Vault or an Azure Storage Account then everybody using that ADF can access that service via ADF.
If you have any other good reasons to use multiple Data Factories please let us know in the comments below.

And although you may have multiple Data Factories you could still use one Data Factory to execute pipelines from a different Data Factory. However you cannot use the Execute Pipeline Activity because it can only execute pipelines within the same Data Factory. 

You can either use the Web Activity or the Web Hook Activity for this. The Web Activity always executes the pipeline asynchronous. This means it does not wait for the result. The Web Hook Activity executes the child pipeline synchronous. Which means it waits until the child pipeline is ready and you could also retrieve the execution result via a call back. In this blog post we will show the synchronous Webhook Activity and in a previous blog post we already showed the asynchronous Web Activity.

1) Give parent access to child via MSI
We will not use a user to execute the pipeline in the child(/worker) Data Factory, but instead we will give the managed identity (MSI) of the parent(/master) Data Factory access to the child(/worker) Data Factory. The minimum role needed is Data Factory Contributor, but you could also use a regular Contributor or Owner (but less is more).
  • Go to the child(/worker) Data Factory (DivisionX in this example)
  • In the left menu click on Access control (IAM)
  • Click on the +Add button and choose Add role assignment
  • Select Data Factory Contributor as Role
  • Use Data Factory as Assign access to
  • Optionally change the subscription
  • Optionally enter a (partial) name of your parent ADF (if you have a lot of data factories)
  • Select your parent ADF and click on the Save button
Give one ADF access to other ADF















2) Add Call Back to Child Pipeline
The parent(/master) pipeline will call the child(/worker) and wait until it receives a call back or until the timeout exceeds. Therefor we need to add a call back activity (in the form of a Web/Webhook activity) to the child pipeline. This activity should be the last activity in your pipeline. There can be more (for example one for success and one for failure), but only the first call back will be handled by the parent.
  • Go to the child(/worker) Data Factory (DivisionX in this example)
  • Open ADF via the Author &Monitor link
  • Open the pipeline that you want to execute from the parent(/master) pipeline
  • Add a String parameter called callBackUri. The value will be automatically provided by the parent
  • Add a String parameter called myInputParam1. The value will be manually provided by the parent
  • Add a Web Activity with the following settings
    • URL: @pipeline().parameters.callBackUri   (to retrieve the pipeline parameter)
    • Method: POST
    • Body: {"Output":{"myOutputParam1":"failed"},"StatusCode":"401"}
      Every status code above 399 will tell the parent that the child failed. This can be a random number, but you could also use the official list to give a little more meaning. An other option is to use the output tag which can contain one or more 'output' parameters that can be read by the parent pipeline. You could for example pass through the error message with the Add dynamic content option.

      If you also want to pass an error description to the parent then you must extend the json message with an error tag {"Output":{"myOutputParam1":"failed"},"StatusCode":"401","error":"ErrorCode":"ParameterError","Message":"Required parameters where not provided"}}
Add two pipeline parameters










Web Activity for callback 













3) Determine URL
This solution will call the Create Run RestAPI of ADF to execute the pipeline. For this you need to replace the marked parts of the URL below by the Subscription ID, Resource Group name, Data Factory name and the Pipeline name of the child(/worker) pipeline. We will use this URL in the next step.

Example URL:
https://management.azure.com/subscriptions/aaaaaa-bbbb-1234-cccc-123456789/resourceGroups/DivisionX/providers/Microsoft.DataFactory/factories/DevisionX-ADF/pipelines/MyChildPipeline/createRun?api-version=2018-06-01


4) Webhook Activity
So for this second example we will use the Webhook Activity. This will execute the pipeline of the child(/worker) pipeline, but now it will wait until it receives a call back or until the timeout exceeds. Besides a status it can also retrieve messages from the child(/worker) ADF. See the json message in step two.
  • Go to your master ADF and click on Author & Monitor
  • Create a new pipeline and add a Webhook Activity to the canvas of the new pipeline
  • Give it a suitable descriptive name on the General Tab
  • Go to the Settings tab and enter the URL of the previous step
  • Choose POST as Method
  • Add a new header called Content-Type and with value application/json
  • As Body enter a JSON message. This could either be a dummy message or you could supply parameters in this message. The child parameter is called myParam1: {"myInputParam1":"bla bla"}
  • Use MSI as Authentication method
  • Enter this URL as Resource: https://management.core.windows.net/
Webhook Activity calling a child pipeline in an other ADF























5) Testing
Now trigger the new parent pipeline and check the monitor of the child ADF. You will see it receives two parameters. One provided in the JSON message in de BODY property and one provided by the Webhook activity itself.
Child pipeline - two parameters










Note that the child pipeline did not fail, because we handled the error. However the parent pipeline did fail because we sent a status code higher than 399. Next check the monitor of the child ADF and see the output parameters of the Webhook activity. It received a value from the child pipeline that could be used in a next activity
Parent pipeline - the result of the callback















Summary
In this blog post you learned how to give one ADF access to an other ADF and how to execute pipelines in that other ADF. The Webhook activity solution gives you a little more control compared to the Web activity. And with the correct json messages you can pass through messages and parameters from the parent to the child and back to the parent.

.

Saturday, 10 October 2020

Snack: Create Azure Automation Runbook

Case
How do you write and execute PowerShell code in Azure?

Azure Automation PowerShell Runbook



















Solution
There are several options to execute PowerShell code in Azure. For example in an Azure Function or in an Azure DevOps pipeline. This post shows you how to write and execute PowerShell code in an Azure Automation Runbook. Other blog posts with PowerShell solutions for DWH projects will link to this blogpost (to don't repeat ourselves). You can find them here.

Note: the screenshots and animated gifs could be slightly outdated on the next layout change of the Azure portal, but we will try to occasionally update those.

1) Create Azure Automation Account
To create a PowerShell runbook we first we need to create an Automation Account. If you already have one with the Run As Account enabled then you can skip this step. The Run As Account allows you to login and easily interact with other Azure Services.
  • Go to the Azure portal and create a new resource
  • Search for automation
  • Select Automation Account
  • Choose a useful name for the Automation Account (it will probably host multiple runbooks)
  • Select your Subscription, Resource Group and the Region
  • Most examples will use the Azure Run As account. So make sure to enable it and then click on the Create button.
Create Azure Automation Account


















2) Add Modules
Before you start writing code you often first need to add some PowerShell modules to your Azure Automation Account. For this example we will add a PowerShell module called Az.Sql which for example is used to up- and downscale Azure SQL databases. Note that modules often depend on other modules like this one depends on Az.Accounts. Make sure to add those modules first, but you will be notified if an other module is required. Try to avoid the outdated AzureRm modules and use the Az modules instead (you cannot mix them).

If you forget this step you will get error messages while running your code that state that some of your commands are not recognized:
Get-AzSqlDatabase : The term 'Get-AzSqlDatabase' is not recognized as the name of a cmdlet, function, script 
file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct 
and try again.
  • Go to the newly created Azure Automation Account
  • Go to Modules in the left menu
  • Click on the Browse Gallery button
  • Search for Az.Accounts
  • Click on Az.Accounts in the result and import this module
  • Also search for Az.Sql (but wait until Az.Accounts is actually imported)
  • Click on Az.Sql in the result and import this module
Adding a new module to your Automation Account


















Note: if you are using an existing Automation Account then you probably already added Az.Accounts, but you might need to update it to a newer version.

2b)  Update Modules
Sometimes you need to update one or more modules within your automation account. Not sure why, but unfortunately the removed this option from the Azure Automation gui! More information here. The new way is to import PowerShell code to a new runbook and then run that runbook to update all your modules (or manually delete all the modules one by one and then add them again). 
  • Go to github and click on the little triangle on the Code button to download the zip file
  • Unzip the downloaded file
  • Go to runbooks and click on Import a runbook
  • Import the local file Update-AutomationAzureModulesForAccount.ps1
  • Give the runbook a description (Name and Runbook type are automatically populated)
  • Edit the runbook and hit the testpanel
  • Enter values for the parameters, besides ResourceGroupName and AutomationAccountName fill in the AzureModuleClass with Az (default value is AzureRm)
  • Hit the start button and check
Update all modules













3) Create Runbook
Now we are ready to create a PowerShell runbook in the newly created Azure Automation Account and start writing some PowerShell code.
  • Go back to the overview page of your newly created Azure Automation Account
  • Click on Runbooks in the left menu
  • Click on the + Create a runbook button to create a new Runbook
  • Enter a descriptive name for the Runbook like UpDownScaleDb
  • Select PowerShell as Runbook type
  • Add a short description to explain the purpose your code and click on the Create button
Create PowerShell Runbook



















4) Edit Runbook code
Next edit the new Runbook if it wasn't already opened by the previous step and start writing (or pasting) code in the editor. It often exists of three parts: Parameters, Login and the actual code. Make sure not to store secrets in you code, but use Azure Key Vault instead. Want to send a notification from within your runbook, then use SendGrid.
# PowerShell example code for testing
Param
(
    # Get your name
    [Parameter(Mandatory=$true,Position=1)]
    [ValidateLength(1,50)]
    [string]$Name
)

Write-Output "Hello $($Name)"

5) Testing
It is often easier to write and test the main code first in PowerShell ISE or Visual Studio Code on your local machine which is much faster and easier to debug. But to test the complete code including parameters and the login you can test your code within the runbook editor. Note that it can take over a minute before the code executes since it will first enter an execution queue.
  • Click on the Test pane button above your script.
  • Then optionally fill in the parameter values on the left side
  • And then hit the Start button to execute the script.
Testing your runbook


















6) Scheduling Runbook
If you want to schedule the execution of your runbook in Azure Automation you first need to publish it via the Runbook editor. After it has been published you can add a schedule to this runbook.
  • Edit the script in the runbook editor
  • Click on publish (the editor will close and you will be redirected to the overview page)
  • In the overview page click on Link to schedule
  • In the Schedule menu you can select an existing schedule or create a new one
  • In the Parameter menu you can provide the value for the parameters
Add schedule to runbook


















7) Add Webhook
If you don't want to schedule your Runbook, but call it from an other service like Azure Data Factory, you have to create a Webhook for ADF

Make sure to choose a correct expire date. By default it is only valid for one year. Also make sure to copy the URL, because you can only see and copy it once. You need this URL in ADF for the Web(hook) activity. Don't share the URL because it is a URL and Password in one.
Adding a Webhook to a Runbook


















Summary
In this post a general explanation on how to create and use an Azure Automation PowerShell Runbook. The real solutions are available here and from now on those posts about Azure Automation will link to this post for the general steps.






Friday, 2 October 2020

Pause all/an Azure Synapse SQL Pool(s) (Az)

Case
I want to schedule a pause of my Azure Synapse SQL Pools to save some money on my Azure bills. Back in 2017 when Synapse was still called SQL Data Warehouse we wrote a post with the AzureRM modules which are now outdated. How does it work with AZ?
Pause and resume Synapse SQL Pools
























Solution
For this example we will have two scripts. The first to stop all Azure Synapse SQL Pools within an Azure subscription. This is especially handy for Development, Test or Acceptance environments where you only turn on the Synapse SQL Pools when you need to develop or test something. When you forget to pause them afterwards this script will pause them all on a scheduled moment. The second script is for pausing (or resuming) a specific Synapse SQL Pool. Probably more suitable for a production environment where you don't want to ruthlessly pause all Synapse SQL Pools.

Note: All basics to create your first Azure Automation Runbook can be found here. Combine that with the code below.

1) Modules
If you want to use this script in an Azure Automation Runbook you first need to add the module Az.Sql and that module first wants the module Az.Accounts to be installed. If you already added AzureRM modules to your Azure Automation account then that is no problem. As long as you only don't mix them up in your runbooks. On your own PC you cannot have both installed. You first need to remove all AzureRm modules before you can add Az modules. 
  • Go to your Azure Automation account in the Azure Portal
  • Click on Modules in the left menu
  • Click on Browse Gallery (not on Add a module)
  • Search for Az.Sql and click on it
  • Next click on Import and then on the Ok button
Note 1: If you haven't installed Az.Account then it will ask you to do that first.
Note 2: It takes a few minutes to import a module
Adding a module

















2) Login
If you want to run the script in PowerShell ISE then you first need to login and select your subscription if you have multiple subscriptions.
# PowerShell code
# Login to Azure (browser popup will appear)
Connect-AzAccount -Confirm

# Optional: select your subscription
Set-AzContext -SubscriptionName "mysubscription"
Login to Azure with PowerShell ISE





















In your Runbook you need to add the following code instead which uses your Run as Account to login. Please read this blogpost for all details.
# PowerShell code
########################################################
# Log in to Azure with AZ (standard code)
########################################################
Write-Verbose -Message 'Connecting to Azure'

# Name of the Azure Run As connection
$ConnectionName = 'AzureRunAsConnection'
try
{
    # Get the connection properties
    $ServicePrincipalConnection = Get-AutomationConnection -Name $ConnectionName        
 
    'Log in to Azure...'
    $null = Connect-AzAccount `
        -ServicePrincipal `
        -TenantId $ServicePrincipalConnection.TenantId `
        -ApplicationId $ServicePrincipalConnection.ApplicationId `
        -CertificateThumbprint $ServicePrincipalConnection.CertificateThumbprint 
}
catch 
{
    if (!$ServicePrincipalConnection)
    {
        # You forgot to turn on 'Create Azure Run As account' 
        $ErrorMessage = "Connection $ConnectionName not found."
        throw $ErrorMessage
    }
    else
    {
        # Something else went wrong
        Write-Error -Message $_.Exception.Message
        throw $_.Exception
    }
}
########################################################

3) The pause all script
Below the above script we need to add the following script which first gets all Azure SQL Servers and then checks whether they also host a Synapse SQL Pool. When one is found it checks the status. If it is still Online (active) then it will pause it. This will take several minutes. Afterwards it will recheck the status and show how long it took to pause. Feel free to add more checks and let us know in the comments what you added.
# PowerShell code
# Get all SQL Servers to check whether they host a Synapse SQL Pool
$allSqlServers = Get-AzSqlServer

# Loop through all SQL Servers
foreach ($sqlServer in $allSqlServers)
{
    # Log which SQL Servers are checked and in which resource group
    Write-Output "Checking SQL Server [$($sqlServer.ServerName)] in Resource Group [$($sqlServer.ResourceGroupName)] for Synapse SQL Pools"
    
    # Get all databases from a SQL Server, but filter on Edition = "DataWarehouse"
    $allSynapseSqlPools = Get-AzSqlDatabase -ResourceGroupName $sqlServer.ResourceGroupName `
                                           -ServerName $sqlServer.ServerName `
                                           | Where-Object {$_.Edition -eq "DataWarehouse"}
    # Loop through each found Synapse SQL Pool
    foreach ($synapseSqlPool in $allSynapseSqlPools)
    {
        # Show status of found Synapse SQL Pool
        # Available statuses: Online Paused Pausing Resuming
        Write-Output "Synapse SQL Pool [$($synapseSqlPool.DatabaseName)] found with status [$($synapseSqlPool.Status)]"
        
        # If status is online then pause Synapse SQL Pool
        if ($synapseSqlPool.Status -eq "Online")
        {
            # Pause Synapse SQL Pool
            $startTimePause = Get-Date
            Write-Output "Pausing Synapse SQL Pool [$($synapseSqlPool.DatabaseName)]"
            $resultsynapseSqlPool = $synapseSqlPool | Suspend-AzSqlDatabase

            # Show that the Synapse SQL Pool has been pause and how long it took
            $endTimePause = Get-Date
            $durationPause = NEW-TIMESPAN –Start $startTimePause –End $endTimePause
            $synapseSqlPool = Get-AzSqlDatabase -ResourceGroupName $sqlServer.ResourceGroupName `
                                                -ServerName $sqlServer.ServerName `
                                                -DatabaseName $synapseSqlPool.DatabaseName
            Write-Output "Synapse SQL Pool [$($synapseSqlPool.DatabaseName)] paused in $($durationPause.Hours) hours, $($durationPause.Minutes) minutes and  $($durationPause.Seconds) seconds. Current status [$($synapseSqlPool.Status)]"
        }
    }
}
In PowerShell ISE the result will look this. To see the result of your runbook you need to check the jobs and the the output.
Running the script in PowerShell ISE







4) The pause one script - parameters
To only select one Synapse SQL Pool we need to provide three parameters: Resource Group name, SQL Server name and the name of the SQL Pool. For this we will add parameter code at the beginning of the complete script (above login). You can add parameter validations to make it more monkey proof.
# PowerShell code
<#
    .SYNOPSIS
        Pause an Azure Synapse SQL Pool
    .DESCRIPTION
        By providing the following parameters you can pause one
        specific Azure Synapse SQL Pool. It will only pause when
        the status is 'online'

    .PARAMETER resourceGroupName
        This is the Resource group where Azure Synapse Analytics
        SQL Pool is located

    .PARAMETER sqlServerName
        This is the name of the Azure SQL Server hosting the Azure
        Synapse Analytics SQL Pool
    
    .PARAMETER SynapseSqlPoolName
        This is the name of the Azure Synapse Analytics SQL Pool

#>
Param(
    # This is the Resource group where Azure Synapse Analytics SQL Pool is located   
    [Parameter(Mandatory=$True)]  
    [String] $resourceGroupName
    ,
    # This is the name of the Azure SQL Server hosting the Azure Synapse Analytics SQL Pool
    [Parameter(Mandatory=$True)]  
    [String] $sqlServerName
    ,
    # This is the name of the Azure Synapse Analytics SQL Pool
    [Parameter(Mandatory=$True)]  
    [String] $SynapseSqlPoolName
) 

5) The pause one script
Now the actual pause one script to pause only one SQL Pool. This replaces the script of step 3. If you also want a resume script you just have to replace Pause-AzSqlDatabase by Resume-AzSqlDatabase. and of course chancing some of the texts and if statement (Online => Paused). You could also merge the pause and resume script into one script by adding an extra parameter to indicate what you want to do.
# PowerShell code
# Get one specific Synapse SQL Pool
$synapseSqlPool = Get-AzSqlDatabase -ResourceGroupName $resourceGroupName `
                                    -ServerName $sqlServerName `
                                    -DatabaseName $SynapseSqlPoolName `
                                    | Where-Object {$_.Edition -eq "DataWarehouse"}

# Check if the Synapse SQL Pool can be found with the provided parameters
if ($synapseSqlPool)
{
    # Show status of found Synapse SQL Pool
    # Available statuses: Online Paused Pausing Resuming
    Write-Output "Synapse SQL Pool [$($synapseSqlPool.DatabaseName)] found with status [$($synapseSqlPool.Status)]"

    # If status is online then pause Synapse SQL Pool
    if ($synapseSqlPool.Status -eq "Online")
    {
        # Pause Synapse SQL Pool
        $startTimePause = Get-Date
        Write-Output "Pausing Synapse SQL Pool [$($synapseSqlPool.DatabaseName)]"
        $resultsynapseSqlPool = $synapseSqlPool | Suspend-AzSqlDatabase

        # Show that the Synapse SQL Pool has been pause and how long it took
        $endTimePause = Get-Date
        $durationPause = NEW-TIMESPAN –Start $startTimePause –End $endTimePause
        $synapseSqlPool = Get-AzSqlDatabase -ResourceGroupName $resourceGroupName `
                                            -ServerName $sqlServerName `
                                            -DatabaseName $SynapseSqlPoolName
        Write-Output "Synapse SQL Pool [$($synapseSqlPool.DatabaseName)] paused in $($durationPause.Hours) hours, $($durationPause.Minutes) minutes and  $($durationPause.Seconds) seconds. Current status [$($synapseSqlPool.Status)]"
    }
}
else
{
    Throw "Synapse SQL Pool [$($SynapseSqlPoolName)] not found. Check parameter values."
}

Running the script in PowerShell ISE





Summary
In this post you saw how to pause one or more Synapse SQL Pools to save some money on your Azure bill, but note that the storage costs will continue when you pause Synapse. Next step is to either schedule it within your Azure Automation account or to add a Webhook and execute it from an other service like Azure Data Factory

Besides scripting you can also use the Rest API of Synapse which is particularly easy if you want to pause or resume from within an ADF pipeline, but that will be explored in an other blogpost.