Saturday, 18 September 2021

ADF Build - missing arm-template-parameters-definition.json

Case
I'm using the new and improved ARM export via Npm to generated and ARM template for my Data Factory so I can deploy it to the next environment, but the Validate step and the Validate and Generate ARM template step both throw an error sayin that the arm-template-parameters-definition.json file can't be found. This file isn't mentioned in the steps from the documentation. How do I add this file and what content should be in it?
Unable to read file: arm-template-parameters-definition.json



















ERROR === LocalFileClientService: Unable to read file: D:\a\1\arm-template-parameters-definition.json, error: {"stack":"Error: ENOENT: no such file or directory, open 'D:\\a\\1\\arm-template-parameters-definition.json'","message":"ENOENT: no such file or directory, open 'D:\\a\\1\\arm-template-parameters-definition.json'","errno":-4058,"code":"ENOENT","syscall":"open","path":"D:\\a\\1\\arm-template-parameters-definition.json"}
  
WARNING === ArmTemplateUtils: _getUserParameterDefinitionJson - Unable to load custom param file from repo, will use default file. Error: {"stack":"Error: ENOENT: no such file or directory, open 'D:\\a\\1\\arm-template-parameters-definition.json'","message":"ENOENT: no such file or directory, open 'D:\\a\\1\\arm-template-parameters-definition.json'","errno":-4058,"code":"ENOENT","syscall":"open","path":"D:\\a\\1\\arm-template-parameters-definition.json"}

Solution
This step is indeed not mentioned within that new documentation, but it can be found if you know where to look for. The messages state that it is indeed an error, but it will continue using a default file. Very annoying, but not blocking for your pipeline. To solve it we need to follow these steps:

1) Edit Parameter configuration
Go to your development ADF and open Azure Data Factory Studio. In the left menu click on the Manage icon (a toolbox) and then click on ARM template under Source Control. Now you will see the option 'Edit parameter configuration'. Click on it.
Edit parameter configuration












2) Save Parameter configuration
Now a new JSON file will be opened (that you can adjust to your needs, but more on that in a later post) and in the Name box above you will see 'arm-template-parameters-definition.json'. Click on the OK button and go to the Azure DevOps repository.
arm-template-parameters-definition.json























3) The result
In the Azure DevOps Repository you will now find a new file in the root of the ADF folder where the subfolders like pipeline and dataset are also located. Run the DevOps pipeline again and you will notice that the error and warning are gone.
The new file had been added to the repository by ADF











Note: that you only have to do this for the development Data Factory (not for test, acceptance or production) and that the ARM template parameter configuration is only available for git enabled data factories. 

Conclusion
In this post you learned how to solve the arm-template-parameters-definition.json not found error/warning. Next step is to learn more about this possibility and the use case of it. Most often it will be used to add extra parameters for options that aren't parameterized. This will be explained in a next post.

In an other following post we will describe the entire Data Factory ARM deployment where you don't need to hit that annoying Publish button within the Data Factory GUI. Everything (CI and CD) will be a YAML pipeline).

thx to colleague Roelof Jonkers for helping

Friday, 17 September 2021

ADF Release - ResourceGroupNotFound

Case
I'm using the Pre and Post deployment PowerShell script from Microsoft within my ADF DevOps deployment, but it gives an error that it cannot find my resource group although I'm sure a gave the Service Principal enough access to this resource group (and I checked for typos). What is wrong and how can I solve this?
ResourceGroupNotFound



















##[error]HTTP Status Code: NotFound
Error Code: ResourceGroupNotFound
Error Message: Resource group 'RG_ADF_PRD' could not be found.

Solution
If the Service Principal indeed has enough permissions on the Azure Resource Group then your Service Principal probably has access to more than one Azure Subscription. The PowerShell function Get-AzDataFactoryV2Pipeline within this script can only get Data Factories from the Active Subscription. If your resource group is not in that active subscription then it will not find it. Only one can be active.
Get-AzDataFactoryV2Pipeline -ResourceGroupName "RG_ADF_PRD" -DataFactoryName "DataFactory2-PRD"
WARNING: TenantId '4e8e12ea-d5b6-40f1-9988-4b09705c2595' contains more than one active subscription. 
First one will be selected for further use. To select another subscription, use Set-AzContext.

The best solution is to create a Service Principal for each subscription where you want to deploy ADF and then give each Service Principal only access to one Azure subscription.

The alternative is to add one extra parameter to the PowerShell script for the subscription ID (or Name ) and then use the PowerShell function Set-AzContext to activate the correct Azure subscription.
Set-AzContext -Subscription $Subscription













In your YAML script you need to add the extra parameter and then either add the subscription ID (or  Name) hardcoded or much better as a variable from the DevOps Variable Group (Library). If you are using the old Release pipelines then hit the three dots behind the Script Arguments textbox to add the extra parameter.
add parameter to YAML














Conclusion
In this post you learned how to fix the Resource Not Found error during the Pre and Post deployment script execution. The best/safest solution is to minimize access for each Service Principal and the work around is to add two lines of code to the example script of Microsoft.

In a next post we will describe the entire Data Factory ARM deployment where you don't need to hit that annoying Publish button within the Data Factory GUI. Everything (CI and CD) will be a YAML pipeline).

thx to colleague Roelof Jonkers for helping

Friday, 3 September 2021

Virtual Network Data Gateway for Power BI

Case
My company doesn't allow public end points on my Azure resources like the Azure SQL Database and the Azure Data Lake. Now Power BI cannot use these sources directly and therefore we have to install an On-premises Data Gateway on an Azure Virtual Machine within the same VNET as my sources (or in a peered VNET). Is there an alternative for this VM solution?
Data Gateways and one with a different icon

























Solution
Yes there is a promising new alternative for the silly VM solution, but there are some caveats which will be shared in the conclusions. A few months ago Microsoft announced the VNET Data Gateway for cloud sources. Now the service is available for testing. The two main benefits are:
  • No need for a Virtual Machine which you need to maintain (but which you probably forget)
  • No need for the On-Premises Data Gateway which you need to nearly update each month
You will still need a Virtual Network (VNET) to connect this service to other Azure services to allow the connection to Power BI or other services of the Power Platform. Within this VNET we need a subnet that can delegate to the Microsoft Power Platform.


1) Resource Provider Microsoft.PowerPlatform
The first step for adding the VNET gateway is to check within your Azure Subscription whether the Resource Provider Microsoft.PowerPlatform is already registered (probably not). By registering this Provider you will be able to connect the Subnet of Step 2 to the Gateway of step 3.
  • Go to the Azure portal and login as an owner of the Subscription.
  • Go to the Subscription overview page (the same subscription where your VNET is also located)
  • In the left menu you will find the option Resource Providers.
  • Search for Microsoft.PowerPlatform and check the Status column. If is Says NotRegistered then select it and hit the Register button in the top.
Register Resource Provider













2) Add Subnet to VNet
Now that we have the new Resource Provider we can add a Subnet to an existing VNET. The VNET is a simple default 'installation' with an address space of 10.240.134.0/23 (512 addresses 10.240.134.0 to 10.240.135.255). 
  • Go to your existing VNET
  • In the left menu click on Subnets to see the available subnets
  • Click in + Subnet (not + Gateway Subnet) to add a new Subnet
  • Give it a suitable name (gatewaysubnet is reserved/not allowed)
  • Choose a small Subnet address range, 28 will give you 16 ip addresses of which 11 can be used to add gateways. IP6 is now allowed at the moment.
  • The most important property for the VNET gateways is Subnet delegation. Make sure to set it to Microsoft.PowerPlatform/vnetaccesslinks.
Add Subnet to VNET












3) Add VNET Gateway
For the final step, creating the gateway, we need to go to the Power Platform admin center. So this means it not an Azure located service. For this step you need to be add least an Azure Network Contributor (but Subscription Owner will work as well).
  • Log in to the Power Platform admin center
  • In the left menu click on Data (preview)
  • Then go to the tab Virtual Network data gateways to see existing gateways or to add a new one
  • Click on + New to add a new VNET Gateway
  • Select the Azure Subscription of your VNET / subnet. Subscriptions where step 1 was not performed will give an error when you select them: Please register "Microsoft.PowerPlatform" as a resource provider for this subscription to continue creating a virtual network data gateway.
  • Select the Resource Group of your VNET
  • Select the VNET
  • Select the Subnet of step 2
  • Give the new Gateway a suitable new
Add Virtual Network data gateway













You can also add other user to use this new gateway, but at this moment you can only add administrators and you wont see the Can use or the Can use + Share option. Probably because it is still in Public Preview at the moment of writing.
Can use or the Can use + Share are missing













Conclusion
In this post you learned about the new Virtual Network Data Gateway for Power BI. The lack of maintenance is the big benefit for this new service. But there are also a few caveats besides being a Preview service. Some of them will probably be solved when the service will get the status General Available.
  1. This new service only works for Power BI premium workspaces. This is a real big deal since I don't want to make all my workspaces premium.
  2. The price is still unknown. May be it will be a premium feature, but then they have to solve the first issue.
  3. The performance to Azure Data Lake is very slow compared to an On-premises Data Gateway on an Azure VM (up to 6 times slower!). This is a bug where they are working on. However this brings up an other issue. How can you tweak the performance of this new gateway. You cannot create a cluster with multiple gateways and you cannot chance the core/memory.
  4. The number of source is still a bit low. Only Microsoft cloud services are supported (since you cannot install third party drivers). 
  5. On-prem sources are also not supported which would be very useful to reduce the number of servers (and maintenance) in my on-prem network since VNET can also be connected via VPN to the on-prem network. However Microsoft products like SQL Server should work since they use the same drivers as Azure SQL Database (not tested).
  6. Can use & Can use + Share is still missing which only allows gateway admins to use this service.
So altogether a very promising new service from Microsoft. Still a couple of issues, but hopefully they will all be fixed when it will be General Available. The real deal breaker will be issue 1 and solving issue 5 will really boost the success of this new service. For more detailed installation steps read Docs.

Tuesday, 31 August 2021

Analyze Azure Data Factory logs - part 3: Power BI

Case
In a previous blog post we showed how to query the logs of Azure Data Factory within Azure Log Analytics, but how can you show that same data in a Power BI report?
Using KQL as source for Power BI














Solution



1) Export KQL query as M query
The previous step of writing that query is the most important step. Make sure you have a query, or better a couple of queries, that represent a dimensional model instead of just a simple table. This will make creating Power BI reports so much easier and faster.
  • Go to the query editor of the Log Analytics workspace
  • Then open an existing query (via Query explorer) or write a new one.
  • Once the query is successful, hit the Export option and choose "Export to Power BI (M query)".
  • This will result in downloading a text file with an M query in it. Analyze the generated M query and optionally adjust it to your own needs
Export to Power BI (M query)













Note that you can also see your KQL query as a parameter for Rest API call and that besides the M query it also contains some explanation for Power BI.

2) Blank Query Power BI
Now the last step is fairly easy. Especially when your queries are already modeled in a dimensional model. If not you will have to do some extra steps
  • Open Power BI desktop and choose Get Data
  • In the Other section at the bottom you will see the Blank Query option
  • Select it and click Connect. A new Power Query Editor window will open
  • Now hit the Advanced Editor option in the Home ribbon.
  • A third Advanced Editor window will open. This is where you copy and paste the content of the downloaded file of step 1 and click on Done.
  • Give the query a name and click on Close & Apply in the Home ribbon.

Repeat this step for all queries that you created for your report and then you can start creating that report! Well, you might need to add some relations between the tables for the best result.
Add the M query to Power BI















Conclusion
In this post we showed you how to use a KQL query in Power BI. Make sure most of the easy transformations are done within KQL. This saves you a lot of time in Power BI. And as mentioned before: make your query like a fact and dimension table. This makes it so much easier to generate the report.

Monday, 5 July 2021

Analyze Azure Data Factory logs - part 2: queries

Case
In a previous blog post about analyzing ADF logs in Azure Log Analytics we mentioned Kusto Query Language (KQL) as the language used for the standard dashboards, but also for the export to Power BI. How does KQL work and how does it compare to TSQL?
T-SQL vs KQL















Solution
In this episode of the series of ADF - Azure Log Analytics we will show you the very basics of KQL and try to compare it to T-SQL for all us DWH and BI developers. After you understand the basics you can for example edit the Microsoft ADF dashboard (template) in Log Analytics to see the more advanced queries that Microsoft used.

Before you start reading make sure you installed Azure Data Factory Analytics like explained in the first blog post. Without that you can not query on ADF. For these query examples we are using the following three ADF log tables: ADFActivityRun, ADFPipelineRun, ADFTriggerRun. Note that the T-SQL queries are not working and are only used to explain how the KQL queries work.


1) Go to the KQL query editor
To start writing your first KQL query we need to go to the editor in Log Analytics.
  • Go to your Log Analytics Worspace via the Azure portal
  • Click on logs in the left menu
  • Close the query 'welcome window'
Query editor














On the left side of the query editor you see the available tables which you can query. On the bottom right you see the queries that you have executed before. Above the Query history your see the actual query editor. After you have written your first query you can hit the save button to store it. With the Query explorer you can find all your saved queries.

2) Select
The select all columns from a table in KQL is done by only mentioning the table name. In the query result it will show most columns, but not all. Which columns are shown in the result varies per table.
// Get all Pipeline runs
ADFPipelineRun 

// TSQL Equivalent
SELECT * FROM ADFPipelineRun  

However you can use the Columns button in the result to (un)select certain (un)wanted columns, but note that this is for the current execution only!
Select 'all' columns in KQL














If you want specific columns in a specific order then we need to use the | project option. Start the next line in the query editor with a pipeline and the word project, followed by a comma separated columnlist:
// Get specific Pipeline runs
ADFPipelineRun 
| project PipelineName, Category, Status, Start, End

// TSQL Equivalent
SELECT PipelineName, Category, Status, Start, End FROM ADFPipelineRun
SELECT specific columns in KQL

















3) Where
The WHERE clause in KQL is done by using the | where option quite similar to T-SQL. Note that by default a where is case-sensitive.
// WHERE case-sensitive
ADFPipelineRun 
| where PipelineName == "000_Master"

// TSQL Equivalent
SELECT * FROM ADFPipelineRun WHERE PipelineName = '000_Master'
To make the query case-insenitive you can replace the second = (equals sign) by a ~ (tilde sign).
// WHERE case-insensitive
ADFPipelineRun 
| where PipelineName =~ "000_MASTER"
A like in the where clause can be done with a contains. No need for adding wildcards via %. Furthermore there are a lot more useful string oparations available in the documentation.
// WHERE like
ADFPipelineRun 
| where PipelineName contains "Master"

// TSQL Equivalent
SELECT * FROM ADFPipelineRun WHERE PipelineName like '%Master%'
WHERE is KQL















4) Order by
The ORDER BY clause in KQL is done by using the | order by option again quite similar to T-SQL. You can use the same asc and desc option as in T-SQL.
// ORDER BY
ADFPipelineRun 
| order by Start desc 

// TSQL Equivalent
SELECT * FROM ADFPipelineRun ORDER BY PipelineName desc
KQL seems to be a bit easier with the placement of the order by in the query. You can also first order and then filter, but lets try to keep the same order as T-SQL. An other neat addition of order by is to use of 'nulls first' or 'nulls last' to handle null values.
// ORDER BY
ADFPipelineRun 
| order by Start desc nulls last
ORDER BY in KQL














5) Calculated Columns 
To create calculations or just simple aliases we need to use the | extend option. In this first example we are using a case statement to add a status order based on the status column. If you are using project then don't forget to add this new column to the project line.
// Create extra column
ADFPipelineRun 
| extend StatusOrder = case (
    Status == "Queued", "1",
    Status == "InProgress", "2",
    Status == "Failed", "3",
    Status == "Succeeded", "4",
    "-1")
| project PipelineName, Category, Status, StatusOrder, Start, End

// TSQL Equivalent (you can also use an alias with AS behind the END instead of in front of the CASE)
SELECT PipelineName
,      Category
,      Status
,      StatusOrder =  CASE Status  
         WHEN 'Queued' THEN 1
         WHEN 'InProgress' THEN 2
         WHEN 'Failed' THEN 3  
         WHEN 'Succeeded' THEN 4
         ELSE -1
       END
,      Start
,      End
FROM   ADFPipelineRun

Create new column with CASE statement

















You can also use this to create an alias (=copy) for a column and if you have multiple extends then put a comma between each extend.
// Multiple extends and adding 'alias'
ADFPipelineRun 
| extend Count = 1, Region = Location

6) Group by
The GROUP BY clause in KQL is done by using the | summarize by option also quite similar to T-SQL, but everything is done on a single line. The summarize operator has a lot of options compared to T-SQL.
// GROUP BY
ADFActivityRun 
| summarize sum(ActivityIterationCount) by PipelineName

// TSQL Equivalent
SELECT sum(ActivityIterationCount)
FROM   ADFActivityRun 
GROUP BY PipelineName
And if you don't want a boring table then you can always add the | render option to for example add a beautiful piechart.
// GROUP BY
ADFActivityRun 
| summarize sum(ActivityIterationCount) by PipelineName
| render piechart
Rendering visuals with KQL
















7) Expand JSON fields
You will notice that there are a few JSON columns in the ADF tables. In T-SQL you can use OPENJSON to extract data from those JSON columns, but KQL has an easier solution. 
JSON columns









It uses the extend and mv-expand option. It's a 'three-stage rocket', but very easy for simple JSON structures. The best feature is that it doesn't throw errors when the JSON of a particular record doesn't contain an expected JSON field. It just shows a null value.
// Extract JSON data
ADFPipelineRun
| extend    PredecessorsObject = parse_json(Predecessors)
| mv-expand PredecessorsObject 
| extend    InvokedByType = PredecessorsObject.InvokedByType
,           CallingPipelineRunId = PredecessorsObject.PipelineRunId
,           CallingPipelineName = PredecessorsObject.PipelineName
| project  RunId, CorrelationId, PipelineName, Status, Start, InvokedByType, CallingPipelineRunId,CallingPipelineName
| order by Start
  1. The first extend row in the query parses the JSON column to a JSON object with the parse_json(json) operator (or alias todynamics).
  2. The row below that with the mv_expand operator is expanding all columns in the JSON object.
  3. The next row with an other extend allows you to create new columns based on the columns in the expanded JSON object.
Extracting JSON message in KQL














Conclusion
In this post you, as an experience T-SQL writer, learned the very basics of the Kusto Query Language (KQL) to extract information from the Azure Data Factory log data. It will probably take a little while getting used the format of KQL queries but after that you will notice that is has some very useful options to extract information from the log data.

In a follow up post we will use this knowledge to write a couple of KQL queries and export them to Power BI to create a report on your ADF log data. Of course making a dimensional model with those queries makes it much easier and faster to create some fancy reports.