Microsoft BI Tools

Saturday, 30 December 2017

Pause everything on your Azure playground

Case
A lot of IT companies have Azure subscriptions that are not primarily for customers, but are more playgrounds to create or try out stuff for clients. A lot of times you try out stuff like Azure Data Warehouse or Azure Analysis Services, services where you are not paying for usage but you are paying because the services are turned on.

When you are playing around with those services and forget to turn them off afterwards, it could get little costly especially when you have dozens of colleagues also trying out all the cool Azure stuff. How do you prevent those unnecessary high bills because of forgotten services?

Pause everything

Solution
You should of course make some agreements about being careful with pricey services, but you can support that with a 'simple' technical solution: run a PowerShell script in Azure Automation Runbook that pauses all often used services each night. An exception list takes care of specific machines or servers that should not be paused. For this example we will pause the following Azure parts:

Azure Virtual Machines (not classics)
Azure SQL Data Warehouses
Azure Analysis Services

This is not a complete list, but once you understand these three, it should be relatively easy to add more services. Still to hard? Add a comment to suggest more services.

1) Automation Account
First we need an Azure Automation Account to run the Runbook with PowerShell code. If you don't have one or want to create a new one, then search for Automation under Monitoring + Management and give it a suitable name like 'maintenance', then select your subscription, resource group and location. For this example I will choose West Europe since I'm from the Netherlands. Keep 'Create Azure Run As account' on Yes. We need it in the code. See step 3 for more details.

Azure Automation Account

2) Credentials
Next step is to create Credentials to run this runbook with. This works very similar to the Credentials in SQL Server Management Studio. Go to the Azure Automation Account and click on Credentials in the menu. Then click on Add New Credentials. You could just use your own Azure credentials, but the best options is to use a service account with a non-expiring password. Otherwise you need to change this regularly.

Create new credentials

3) Connections
This step is for your information only and to understand the code. Under Connections you will find a default connection named 'AzureRunAsConnection' that contains information about the Azure environment, like the tendant id and the subscription id. To prevent hardcoded connection details we will retrieve these fields in the PowerShell code.

Azure Connections

4) Variables
For the exception lists we will be using string variables to prevent hardcode machine and service names in de code. Go to Variables and add three new string variables: one for each type of machine or service we need to pause:

ExceptionListVM
ExceptionListDWH
ExceptionListAAS

Fill them with a semicolon separated list of names or, if you do not have any exception, with a semicolon (;) only.

Add string variables

5) Modules
The Azure Analysis Services methods (cmdlets) are in a separate PowerShell module which is not included by default. If you do not add this module you will get errors telling you that the method is not recognized. See below for more details.

The term 'Get-AzureRmAnalysisServicesServer' is not recognized
as the name of a cmdlet, function, script file, or operable program.

Go to the Modules page and check whether you see AzureRM.AnalysisServices in the list. If not then use the 'Browse gallery' button to add it, but first add AzureRM.Profile because the Analysis module will ask for it. Adding the modules could take a few minutes!

Add modules

6) Runbooks
Now it is time to add a new Azure Runbook for the PowerShell code. Click on Runbooks and then add a new runbook (There are also several example runbooks of which AzureAutomationTutorialScript could be useful as an example). Give your new Runbook a suitable name like 'PauseEverything' and choose PowerShell as type.

Add Azure Runbook

7) Edit Script
After clicking Create in the previous step the editor will be opened. When editing an existing Runbook you need to click on the Edit button to edit the code. You can copy and paste the code below to your editor. Study the green comments to understand the code. Also make sure to compare the variable names in the code to the once created in step 4 and change them if necessary.

Edit the PowerShell code

# PowerShell code
# Connect to a connection to get TenantId and SubscriptionId
$Connection = Get-AutomationConnection -Name "AzureRunAsConnection"
$TenantId = $Connection.TenantId
$SubscriptionId = $Connection.SubscriptionId
  
# Get the service principal credentials connected to the automation account. 
$null = $SPCredential = Get-AutomationPSCredential -Name "Administrator"

# Login to Azure ($null is to prevent output, since Out-Null doesn't work in Azure)
Write-Output "Login to Azure using automation account 'Administrator'."
$null = Login-AzureRmAccount -TenantId $TenantId -SubscriptionId $SubscriptionId -Credential $SPCredential
  
# Select the correct subscription
Write-Output "Selecting subscription '$($SubscriptionId)'."
$null = Select-AzureRmSubscription -SubscriptionID $SubscriptionId
  
# Get variable values and split them into arrays
$ExceptionListAAS = (Get-AutomationVariable -Name 'ExceptionListAAS') -split ";"
$ExceptionListVM = (Get-AutomationVariable -Name 'ExceptionListVM') -split ";"
$ExceptionListDWH = (Get-AutomationVariable -Name 'ExceptionListDWH') -split ";"


################################
# Pause AnalysisServicesServers
################################
Write-Output "Checking Analysis Services Servers"
# Get list of all AnalysisServicesServers that are turned on (ProvisioningState = Succeeded)
$AnalysisServicesServers = Get-AzureRmAnalysisServicesServer |
Where-Object {$_.ProvisioningState -eq "Succeeded" -and $ExceptionListAAS -notcontains $_.Name}
# Loop through all AnalysisServicesServers to pause them
foreach ($AnalysisServicesServer in $AnalysisServicesServers)
{
    Write-Output "- Pausing Analysis Services Server $($AnalysisServicesServer.Name)"
    $null = Suspend-AzureRmAnalysisServicesServer -Name $AnalysisServicesServer.Name
}


################################
# Pause Virtual Machines
################################
Write-Output "Checking Virtual Machines"
# Get list of all Azure Virtual Machines that are not deallocated (PowerState <> VM deallocated)
$VirtualMachines = Get-AzureRmVM -Status |
Where-Object {$_.PowerState -ne "VM deallocated" -and $ExceptionListVM -notcontains $_.Name}
# Loop through all Virtual Machines to pause them
foreach ($VirtualMachine in $VirtualMachines)
{
    Write-Output "- Deallocating Virtual Machine $($VirtualMachine.Name) "
    $null = Stop-AzureRmVM -ResourceGroupName $VirtualMachine.ResourceGroupName -Name $VirtualMachine.Name -Force 
}
# Note: Classic Virtual machines are excluded with this script (use Get-AzureVM and Stop-AzureVM)


################################
# Pause SQL Data Warehouses
################################
Write-Output "Checking SQL Data Warehouses"
# Get list of all Azure SQL Servers
$SqlServers = Get-AzureRmSqlServer
# Loop through all SQL Servers to check if they host a DWH
foreach ($SqlServer in $SqlServers)
{
    # Get list of all SQL Data Warehouses (Edition=DataWarehouse) that are turned on (Status = Online)
    $SqlDatabases = Get-AzureRmSqlDatabase -ServerName $SqlServer.ServerName -ResourceGroupName $SqlServer.ResourceGroupName |
    Where-Object {$_.Edition -eq 'DataWarehouse' -and $_.Status -eq 'Online' -and $ExceptionListDWH -notcontains $_.DatabaseName} 
    # Loop through all SQL Data Warehouses to pause them
    foreach ($SqlDatabase in $SqlDatabases)
    {
        Write-Output "- Pausing SQL Data Warehouse $($SqlDatabase.DatabaseName)"
        $null = Suspend-AzureRmSqlDatabase -DatabaseName $SqlDatabase.DatabaseName -ServerName $SqlServer.ServerName -ResourceGroupName $SqlDatabase.ResourceGroupName
    }
}

Write-Output "Done"

Note 1: This is a very basic script. No error handling has been added. Check the AzureAutomationTutorialScript for an example. Finetune it for you own needs.
Note 2: There are often two versions of an method like Get-AzureRmSqlDatabase and Get-AzureSqlDatabase. Always use the one with "Rm" in it (Resource Managed), because that one is for the new Azure portal. Without Rm is for the old/classic Azure portal.
Note 3: Because Azure Automation doesn't support Out-Null I used an other trick with the $null =. However the Write-Outputs are for testing purposes only. Nobody sees them when they are scheduled.
Note 4: The code for Data Warehouses first loops through the SQL Servers and then through all databases on that server filtering on edition 'DataWarehouse'.

7) Testing
You can use the Test Pane menu option in the editor to test your PowerShell scripts. When clicking on Run it will first Queue the script before Starting it. If nothing needs to be paused the script runs in about a minute, but pausing or deallocating items takes several minutes.

Testing the script in the Test Pane

8) Publish
When your script is ready, it is time to publish it. Above the editor click on the Publish button. Confirm overriding any previously published versions.

Publish the Runbook

9) Schedule
And now that we have a working and published Azure Runbook, we need to schedule it. Click on Schedule to create a new schedule for your runbook. For this pause everything script I created a schedule that runs every day on 2:00AM (02:00). This gives late working colleagues more than enough time to play with all the Azure stuff before there service will be paused.

Add Schedule

Summary
In this post you saw how you can pause all expensive services in an playground environment. If a colleague don't wants to pause his/her service then we can use the variables to skip the particular service. As mentioned before: this is not a complete list. Feel free to suggest more services, that can be paused, in the comments.

This script requires someone to maintain the exception list variables. I have created an alternative script that uses tags instead of centralized Runbook variables. This allows you to add a certain tag to your service to avoid the pause which gives colleagues more control.

Update: code for classic Virtual Machines (for separate Runbook)

# PowerShell code
# Connect to a connection to get TenantId and SubscriptionId
$Connection = Get-AutomationConnection -Name "AzureRunAsConnection"
$TenantId = $Connection.TenantId
$SubscriptionId = $Connection.SubscriptionId
  
# Get the service principal credentials connected to the automation account. 
$null = $SPCredential = Get-AutomationPSCredential -Name "Administrator"

# Login to Azure ($null is to prevent output, since Out-Null doesn't work in Azure)
Write-Output "Login to Azure using automation account 'Administrator'."
$null = Add-AzureAccount -Credential $SPCredential -TenantId $TenantId

# Select the correct subscription (method without Rm)
Write-Output "Selecting subscription '$($SubscriptionId)'."
$null = Select-AzureSubscription -SubscriptionID $SubscriptionId

# Get variable values and split them into arrays
$ExceptionListVM = (Get-AutomationVariable -Name 'ExceptionListVM') -split ";"


#################################
# Pause Classic Virtual Machines
#################################
Write-Output "Checking Classic Virtual Machines"

# Get list of all Azure Virtual Machines that are not deallocated (Status <> StoppedDeallocated)
$VirtualMachines = Get-AzureVM  |
Where-Object {$_.Status -ne "StoppedDeallocated" -and $ExceptionListVM -notcontains $_.Name}

# Loop through all Virtual Machines to pause them
foreach ($VirtualMachine in $VirtualMachines)
{
    Write-Output "- Deallocating Classic Virtual Machine $($VirtualMachine.Name) ($($VirtualMachine.ServiceName))"
    $null = Stop-AzureVM -ServiceName $VirtualMachine.ServiceName -Name $VirtualMachine.Name -Force 
}


Write-Output "Done"

Note 1: Login method name is slightly different
Note 2: Other methods use the version without Rm in the name: Stop-AzureRmVM => Stop-AzureVM

Monday, 20 November 2017

CTAS - The fastest way to load data on a MPP system

Case

What is the fastest way to using Extract Load Transform (ELT) on a Massively Parallel Processing (MPP) system like on a Azure SQL Datawarehouse.

CTAS: The way to go!

Solution

When Extracting, Loading and Transforming data on a MPP system like Azure SQL Data Warehouse there are several ways to do just that. Currently Create Table As Select (CTAS) is by far the fastest.

Whats is a MPP system?
MPP stands for Massively Parallel Processing and is a database system that uses a control node to distribute the data across several seperate Compute Nodes. This makes it possible to load very large amounts of data in a fast way. All this is done automatically and for the end-user it appears to be one database. When you use traditional ETL software like SSIS to extract, load and transform data you make use of the memory that is allocated to the SSIS system and thus take the data out of the database, which is far less efficient.

How a MPP looks like under the hood

1) What is CTAS?
CTAS stands for Create Table As Select. As the name suggests the operation creates a new table using a select statement and is super fast. CTAS is fast because the data stays on the MPP and thus makes use of all the capabilities of a MPP system.

2) How to use CTAS
When you create a CTAS statement, you can choose to set two options, namely;

Distribution options - Setting this option is mandatory
Table options - Setting this option is optional. When not supplied a Clustered ColumnStore Index is used.

Distribution options

When you create the CTAS command you can choose between HASH, ROUND ROBIN or Replicate as distribution option.

HASH is used to divide the data in equal sized sections and distribute them to the nodes using a distribution column. When doing this you try to evenly distribute the data on the available nodes. Choosing the correct distribution key here is paramount otherwise you can get skew on the distribution between the nodes. To put it simply; when you have 4 nodes and the data is not distributed evenly then it hurts the data retrieval speed. For example you want to use this option when creating Fact tables that are large (or very large Dimension tables). You can check for skew using the command DBCC PDW_SHOWSPACEUSED

--Creating a table using CTAS and Hash
CREATE TABLE dbo.CTASHash
WITH
(
DISTRIBUTION = HASH(FactCallCenterID)
)
AS
SELECT 
   FactCallCenterID
 , DateKey
 , WageType
 , Calls
 , AutomaticResponses
 , Orders
 , IssuesRaised
 , AverageTimePerIssue
 , ServiceGrade
FROM dbo.FactCallCenter

ROUND ROBIN is used when you dont want to choose a distribution column but are fine when the data is distributed randomly across the nodes. This is also the default option when you do not define a distribution option. This option is used for example for Staging tables. It is advised to always explicitly define the ROUND ROBIN in the CTAS statement.

--Creating a table using CTAS and Round Robin
CREATE TABLE dbo.CTASRobin
WITH
(
 DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT 
   AccountKey
 , ParentAccountKey
 , AccountCodeAlternateKey
 , ParentAccountCodeAlternateKey
 , ValueType
 , CustomMemberOptions
FROM dbo.DimAccount

REPLICATE is used to put the data on every node available so it can be used for quick access. This is useful when creating regular sized Dimensions. When the data is available on each and every node then it safes on moving the data between nodes when using joins. The full table will be available on all nodes.

--Creating a table using CTAS and Replicate
CREATE TABLE dbo.CTASReplicate
WITH
(
 DISTRIBUTION = REPLICATE
)
AS
SELECT 
   ScenarioKey
 , ScenarioName
FROM dbo.DimScenario

Table options

Besides choosing the distribution option, you can optionally also use the following table options;

CLUSTERED COLUMNSTORE INDEX is a table option that is one of the most efficient ways to store data in Azure DWH. It improves data compression and query performance for data warehousing workloads and outperform Clustered Index and Heap tables. That makes them the best choice for large tables. Using a Clustered Columnstore Index is considered to be the best choice when you are unsure which table option you should best use. It is also the default table option when you only use one of the distribution options.

--Creating a table using CTAS and Clustered Columnstore Index
CREATE TABLE dbo.CTASRobinCluster
WITH
(
   DISTRIBUTION = ROUND_ROBIN
 , CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT 
   AccountKey
 , ParentAccountKey
 , AccountCodeAlternateKey
 , ParentAccountCodeAlternateKey
 , ValueType
 , CustomMemberOptions
FROM dbo.DimAccount

HEAP is a table option that is usefull when temporarily loading data on Azure DWH and is the fastest way to load your data into a table. It is not advisable to use a heap table when the data in the table is frequently grouped together. That is because the data must be sorted before it can be grouped.

--Creating a table using CTAS and Heap
CREATE TABLE dbo.CTASRobinHEAP
WITH
(
   DISTRIBUTION = ROUND_ROBIN
 , HEAP)
AS
SELECT 
   AccountKey
 , ParentAccountKey
 , AccountCodeAlternateKey
 , ParentAccountCodeAlternateKey
 , ValueType
 , CustomMemberOptions
FROM dbo.DimAccount

CLUSTERED INDEX is a table option that you use when you want to sort and store the data rows in the table based on a specific column. The disadvantage of using a Clusted Index table is that only queries that use the defined Clustered Index column benefit from the index. This can be somewhat fixed by using additional Nonclustered indices, but that would increase use of space and processing time.

--Creating a table using CTAS and Clustered Index
CREATE TABLE dbo.CTASRobinClusteredIndex
WITH
(
 DISTRIBUTION = ROUND_ROBIN,
 CLUSTERED INDEX (AccountKey)
)
AS
SELECT 
   AccountKey
 , ParentAccountKey
 , AccountCodeAlternateKey
 , ParentAccountCodeAlternateKey
 , ValueType
 , CustomMemberOptions
FROM dbo.DimAccount

PARTITION is the table option that you use when you want to determine how the rows are grouped within each distribution. To use it you need to choose a partition column name. This column can be of any data type. You use partitioning to improve query performance and data maintenance and it avoids transaction logging. Using partitioning during the load proces can also substantially improve performance.

--Creating a table using CTAS and Partition
CREATE TABLE dbo.CTASRobinPartition
WITH
(
 DISTRIBUTION = HASH(ProductKey),
 CLUSTERED COLUMNSTORE INDEX,
 PARTITION
    (
        OrderDateKey RANGE RIGHT FOR VALUES
        (
        20000101,20010101,20020101,20030101,20040101,20050101,20060101,20070101,20080101,20090101,
        20100101,20110101,20120101,20130101,20140101,20150101,20160101,20170101,20180101,20190101,
        20200101,20210101,20220101,20230101,20240101,20250101,20260101,20270101,20280101,20290101
        )
    )
)
AS
SELECT 
 ProductKey
 ,OrderDateKey
 ,DueDateKey
 ,ShipDateKey
 ,SalesAmount
 ,TaxAmt
FROM dbo.FactInternetSales

3) Advantages of CTAS
With CTAS you are able to create and recreate tables using a specific distribution type and its very fast. If you have created HEAP tables and want to see if a different distribution type is a better option, then you can simply recreate the table using CTAS with the desired Distribution option. Simply create a copy of the table with a different name, drop the old table and rename the copy table to the original table name.

--Creating a table using CTAS and Partition
CREATE TABLE dbo.CTASRenameTemp
WITH
(
 DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT 
 *
FROM dbo.DimAccount;

DROP TABLE dbo.CTASRename;

RENAME OBJECT dbo.CTASRenameTempTO CTASRename;

3) CTAS tips
When you create a copy a table using CTAS and do not manipulate any of the columns, all the settings of the columns are left intact. For example the datatype and nullability. When you (re)create a column in the table then you explicitly have to cast the table and optionally use the ISNULL() function to set the nullability of the column. For the latter if you do not use the ISNULL() function then the column automatically is created allowing NULL. In below example the first column allows NULL's and the second doesnt.

--CTAS Tips
CREATE TABLE dbo.CTASCasting
WITH
(
 DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT 
   CAST(ValueType AS VARCHAR(100)) AS ValueTypeNull
 , ISNULL(CAST(ValueType AS VARCHAR(100)),'Do not allow NULL') AS ValueTypeNotNull
FROM dbo.DimAccount;

Summary

With CTAS you can quickly create and recreate tables without having to build complex ETL processes. It is fast, flexible and easy to use and it gives you the option to build different styles of tables that are best used in different scenario's like fast loading or fast data retrieving. And if you want to test the table with other options then you simply recreate it with the data.

Thursday, 16 November 2017

Cognitive functions U-SQL: emotion, age & gender

Case
U-SQL has cognitive capabilities to analyse pictures of persons to detect age, gender and emotions. How do they work and do I need Azure Cognitive Service?

U-SQL Cognitive Capabilities

Solution
Good news is that you only need Azure Data Lake (Analytics and Store) with a U-SQL job. Downside is that U-SQL does not yet have the full functionality of Azure Cognitive Services, but all the basics are available. In a previous blog post we showed the basics of the cognitive capabilities in U-SQL and an example of tagging images to add descriptive labels to it. If you never used U-SQL before then first read that post. This follow-up post continues with two new examples. Detecting emotions and detecting age & gender .

Starting point
The starting point of this blog post is an Azure Data Lake Store (ADLS) with a collection of 'random' pictures of humans. We have a folder called 'faces' that contains random images which we wil use for these next two examples.

Test faces

1) Emotions Script
The emotion script scans the pictures for faces and then tries to determine the emotion of each face (anger, contempt, disgust, fear, happiness, neutral, sadness, surprise). For each face it shows where it was located in the picture and then shows its emotion and the confidence rate for that emotion.

Me a few weeks ago at a party

Referencing assemblies
For emotion scanning we need one extra reference called "ImageEmotion".

// Needed for image extraction and emotions
REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY ImageEmotion;

Extract image files
This code, to extract image files from an ADLS container, is exactly the same as in the previous examples .

// Get the image data from ADLS container
@images =
    EXTRACT     FileName string, 
                ImgData byte[]
    FROM        @"/faces/{FileName}.jpg"
    USING new Cognition.Vision.ImageExtractor();

Transform data
Scanning the images for faces and their emotion is done by cross joining the images rowset to the EmotionApplier method. The column names, datatypes and column order are fixed, but you can add aliases for different column names or change the order in the SELECT part of the query.

The query returns one record per face on the image. Besides the emotion you also get a confidence rate, the number of faces, the face number and the position on the image.

// Query detects emotion and the confidence
// If there are multiple faces it creates
// one record for each face. It also show
// the position of the face on the picture.
@emotions =
    SELECT FileName.ToLower() AS FileName,
        Details.NumFaces,
        Details.FaceIndex,
        Details.RectX,
        Details.RectY,
        Details.Width,
        Details.Height,
        Details.Emotion,
        Details.Confidence
    FROM @images 
    CROSS APPLY
        USING new Cognition.Vision.EmotionApplier() AS Details(
            NumFaces int, 
            FaceIndex int, 
            RectX float,
            RectY float,
            Width float,
            Height float, 
            Emotion string, 
            Confidence float);

Output data
This is the same code as in the previous examples to output the detected emotions to a file in an ADLS container.

// Output the emotions rowset to a CSV file
// located in the Azure Data Lake Store
OUTPUT @emotions
    TO "/faces/emotions.csv"
    ORDER BY    FileName
    USING Outputters.Csv(outputHeader: true);

Download the complete script here.

The result

Now the emotion script is ready to run. Click on the submit button and wait for the job to finish. This could take a few moments! Then browse to the ADLS folder and preview the file to see the result.

The result with in red the happy man from above

2) Age/gender Script
The age/gender script scans the pictures for faces and then tries to determine the age en gender of each face. It is very similar to the emotion script.

Me at 43

Referencing assemblies
For age and gender scanning we need one extra reference called "FaceSdk".

// Needed for image extraction and age/gender
REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY FaceSdk;

Extract image files
Again the same code as in the previous examples to extract image files from an ADLS container.

// Get the image data from ADLS container
@images =
    EXTRACT 
        FileName string, 
        ImgData byte[]
    FROM @"/faces/{FileName}.jpg"
    USING new Cognition.Vision.ImageExtractor();

Transform data
Scanning the images for age and gender and their emotion is done by cross joining the images rowset to the EmotionApplier method. The columnnames, datatypes and order are fixed, but you can add aliases for different columnnames.

The query returns one record per face on the image. Besides the age and gender you also get the number of faces, the face number and the position on the image.

// Query detects age and gender
// If there are multiple faces it creates
// one record for each face. It also show
// the position of the face on the picture.
@faces_analyzed =
    SELECT FileName.ToLower() AS FileName,
        Details.NumFaces,
        Details.FaceIndex,
        Details.RectX, Details.RectY, Details.Width, Details.Height,
        Details.FaceAge,
        Details.FaceGender
    FROM @images
    CROSS APPLY
        USING new Cognition.Vision.FaceDetectionApplier() AS Details(
            NumFaces int, 
            FaceIndex int, 
            RectX float, RectY float, Width float, Height float, 
            FaceAge int, 
            FaceGender string);

Output data
Outputting the data to ADLS uses the same code as in the previous examples.

// Output the gender and age rowset to a CSV file
// located in the Azure Data Lake Store
OUTPUT @faces_analyzed
    TO "/faces/agegender.csv"
    USING Outputters.Csv(outputHeader: true);

Download the complete script here.

The result
Now the age and gender script is ready to run. Click on the submit button and wait for the job to finish. This could take a few moments! Then browse to the ADLS folder and preview the file to see the result.

The result with my photo in red

Summary
This post showed you how to use U-SQL to detect emotion, age and gender from pictures. The next step could be to join these examples in one big script. When you want to try that, keep in mind that the ON clause uses two = instead of one (C# instead of TSQL): ON a.FileName == e.FileName. If you want to try these scripts your self, then you can only do that in the Azure portal. The U-SQL projects for Visual Studio do not yet support these extensions.

As said before the functionality in U-SQL is not yet the same as in Azure Cognitive Services which has much more options (and there my age was estimated at 39 with the same picture). Hopefully this will change, but for now the basics are working. Keep an eye on the Data Lake topic page where we will post new examples when more functionality is available.

Cognitive functions U-SQL: image tagging

Case
U-SQL has cognitive capabilities to analyse images. How do they work? Do I need Azure Cognitive-services?

U-SQL Cognitive Capabilities

Solution
Good news is that you only need Azure Data Lake (Analytics and Store) with a U-SQL job. Downside is that U-SQL does not yet have the full functionality of Azure Cognitive Services, but all the basics are available. This blog post starts with a very simple image extraction script to explain the basics of the U-SQL cognitive functions. In the second example we will tag images to add descriptive labels to them.

In a second post we will also show how to detect faces on images and extract emotion, gender and age from them. The base of these scripts are all very similar.

Starting point
The starting point of this blog post is an Azure Data Lake Store (ADLS) with a collection of 'random' images. We have a folder called 'objects' that contains random object images which we wil use for these first two scripts.

The content of ADLS container with random google image pictures

Create ADLA environment
To start we need to create a new Azure Data Lake Analytics (ADLA) environment and connect it to the existing ADLS with the image collection. Go to the Azure portal and click on New in the top left corner of the dashboard and locate ADLA under "Data + Analytics". Supply the basic stuff like name, subscription, resource group and location. One of the last steps is selecting the ADLS (or create a new one). Unless you have a good reason to deviate, it is wise to use the same location for ADLS and ADLA to prevent unnecessary data trafic around the world which could make your queries slower and therefore costing you extra money.

Creating new ADLA and connect it to ADLS

Install U-SQL Extensions
To make use of the cognitive functions in U-SQL, we first need to install the extensions. Go to Sample Scripts in the menu of ADLA and then click on Install U-SQL Extensions in the header. This assembly installation takes a few minutes, but you only have to do this once per ADLA.

Install U-SQL extensions

You can check the internal database in the Data Explorer to see which assemblies are installed. The Data Explorer button can be found on the ADLA overview page in the header.

Check which assemblies are installed

A) Basic script
Let's start with a very basic example: Extracting image files from an ADLS container and create a CSV file with all filenames in it.

1) Create new job
On the ADLA overview page click on +New Job and then give it a suitable name before we start coding.

Create new U-SQL job

2) Referencing assemblies
The cognitive image scripts in U-SQL always start with adding references. For image extraction we need to add a reference to "ImageCommon".

// Needed for image extraction
REFERENCE ASSEMBLY ImageCommon;

3) Extract image files
Next step is to extract the actual files from the ADLS container and store them in a rowset variable called @images. The ImageExtractor method can only get the filename and the actual bytes of the file. The order and datatype of these columns are fixed, but you can use different column names.

It looks a bit like a T-SQL SELECT statement, but because we are getting unstructured data it starts with EXTRACT instead of SELECT and we need to specify the data type. The FROM does not get the data from a table, but from the ADLS container called "objects" and the construction with {FileName}.jpg is a wildcard to only get JPG images from that container.

// Get the image data from ADLS container
@images =
    EXTRACT     FileName string,
                ImageData byte[]
    FROM        @"/objects/{FileName}.jpg"
    USING new Cognition.Vision.ImageExtractor();

4) Transform data
For our CSV with filenames we only want to extract the filename column from the rowset variable called @images. This is done with a very simple SELECT query on the rowset variable from the previous step to extract the required data.

// Create a list of filenames
@result = 
    SELECT      FileName
    FROM        @images;

You can add an ORDER BY clause, but it requires to add FETCH to specify the number of rows that you want to select and sort. By default the ORDER BY is case sensitive (just like C#). You can overcome this by adding .ToLower() after the column name.

// Create a orderd list of filenames
// Note 1: ORDER BY requires the FETCH option to supply the nummer of rows
// Note 2: ORDER BY is case sensitive. Workaround: add .ToLower() 
// Note 3: ORDER BY can be moved to OUTPUT section (below TO)
@result = 
    SELECT      FileName
    FROM        @images
    ORDER BY    FileName.ToLower() 
    FETCH       10 ROWS;

5) Output data
Last step is to save the data in a CSV file in an ADLS container. In this example we are outputting the rowset variable @result that was created in the previous step. The outputter.csv has many options to format your CSV file, but they are all optional.

// Output the rowset to a CSV file located in the Azure Data Lake Store
OUTPUT @result
    TO "/objects/filenamelist.csv"
    USING Outputters.Csv(outputHeader: true, quoting: false);

Instead of a hardcoded path in the OUTPUT section you could also use a variable to move the hardcoded part to the top of your script.

// Declare where the result should be stored
DECLARE @outputpath string = "/objects/filenamelist.csv";

// Output the rowset to a CSV file located in the Azure Data Lake Store with variable
OUTPUT @result
    TO @outputpath
    USING Outputters.Csv();

There is an alternative place for the ORDER BY. You can also add it in the OUTPUT section right below the TO clause. It does not allow the FETCH option, which is a good thing, but it also does not allow the .ToLower() workaround (causing a case sensitive ordering). You could solve that by lowering it in the @result rowset instead.

// Create a list of filenames (lowercase)
@result = 
    SELECT      FileName.ToLower() AS FileName
    FROM        @images;

// Output the rowset to a CSV file located in the Azure Data Lake Store
// ORDERED BY filename descending.
OUTPUT @result
    TO "/objects/filenamelist.csv"
    ORDER BY    FileName DESC
    USING Outputters.Csv(outputHeader: true);

Download the complete script here.

6) Run Job
Now the script is ready to run. To improve the performance we increase the AUs a little bit, but this increases the costs. In a later post the optimal settings will be explained. Then click on the submit button and wait for the job to finish. This could take a few moments!

Running the job (not the actual speed)

7) The result
When the job has finished you can preview the result file in ADLS. Use the Data Explorer to browse to the folder and then preview the generated CSV file.

Preview result in Data Explorer

B) Tagging script
Image tagging means that it will scan the images and add descriptive words to it including a probability rate to show you how certain it is about that particular word. If you have a picture of someone cycling in the mountains then it will add words like bicycle, mountain, outdoor, person, sky.

Example

Referencing assemblies
For image tagging we need one extra reference called "ImageTagging".

// Needed for image extraction and tagging
REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY ImageTagging;

Extract image files
This is the same code as in the previous example to extract image files from an ADLS container.

// Get the image data from ADLS container
@images =
    EXTRACT     FileName string, 
                ImgData byte[]
    FROM        @"/objects/{FileName}.jpg"
    USING new Cognition.Vision.ImageExtractor();

Transform data
Tagging the images is a two step action where it first adds (zero, one or) multiple tags and the probability in value pairs. The second step is to convert all those value pairs to a string which we can export. It also shows the number of tags added.

// Process the images and add multiple tag pairs (tag and probability rate)
// NumObjects contains the number of tag pairs added to the image
@tags =
    PROCESS  @images 
    PRODUCE  FileName,
             NumObjects int,
             Tags SQL.MAP<string, float?>
    READONLY FileName
    USING new Cognition.Vision.ImageTagger();

// We need to convert the tagpairs to a string which we can export
// The string will look like: bicycle:0.9998484;outdoor:0.9164549;transport:0.7914466
@tags_serialized =
    SELECT  FileName.ToLower() AS FileName,
    NumObjects AS TagsCount,
    String.Join(",", Tags.Select(x => String.Format("{0}:{1}", x.Key, x.Value))) AS TagsString
    FROM  @tags;

Output data
This is the same code as in the previous example to output the filename and tags to a file in an ADLS container. Only the variablename and filename did change.

// Output the rowset to a CSV file located in the Azure Data Lake Store
OUTPUT @tags_serialized
    TO "/objects/tagging.csv"
    ORDER BY    FileName
    USING Outputters.Csv(outputHeader: true);

Download the complete script here.

The result
Now the script is ready to run. Click on the submit button and wait for the job to finish. Again, this could take a few moments! Then browse to the ADLS folder and preview the file to see the result.

The result with in red the cyclist from above

Note:
The tagging in Azure Cognitive Services - Computer vision API has some additional options, but we will show that in a future post.

Summary
In this post you saw how to extract images from ADLS and process them with U-SQL in ADLA. We also showed how tagging of images works and in the next post we will handle the scanning of faces for emotions, gender and age. If you want to try these scripts your self, then you can only do that in the Azure portal. The U-SQL projects for Visual Studio do not yet support these extensions.

Monday, 23 October 2017

Use PolyBase to read Data Lake in Azure SQL DW

Case
I have a file in an Azure Data Lake Store (ADLS) folder which I want to use in my Azure SQL Data Warehouse. In a previous blog post you used PolyBase to get the data from an Azure Blob Storage container via its access keys. How can I use PolyBase to get the data from ADLS and push the content of that file to Azure SQL DW?

Azure SQL Data Warehouse - PolyBase on ADLS

Solution
In the previous blog post we showed how to read that file from an Azure Blob Storage container via its access keys using PolyBase. However ADLS does not work with those keys, but uses the Azure Active Directory to provide access. To get authorization via Azure Active Directory we need to register a 'Web app / API' application in Azure Active Directory that does the authorization for us. That sounds very difficult and the documentation on MSDN is not very helpful, but in the end it was quite easy.

a) Starting point
The starting point of this blog post is a file in an ADLS folder called 'mySubFolder'. The file was created in a previous blog post about U-SQL that can quickly process large amounts of data files in Azure Data Lake Analytics (ADLA). The name of our ADLS is 'bitools'.

Starting point: CSV file in ADLS

The content of the CSV file

a1) App registrations
Go to the Azure portal and search for Azure Active Directory in the search box located in the header. This will bring you to the Azure Active Directory from your subscription. Then click on App registrations in the menu. It will show a list of all existing registrations. Next step is to click on New application registration to create a new registration for our data lake.

New application registration

a2) New application registration
Enter a new descriptive name like 'Data Lake bitools' so you will know where it is used for. Choose 'Web app / API' as Application type and then you need to enter a URL. Since we are not using the Sign-on URL property (we use the Azure sign-on), you can just enter any url like 'https://microsoft-bitools.blogspot.com'. When complete click on the Create button.

New application registration

a3) Edit application registration - Application ID
Now search your newly created Application registration to get its Application ID. Copy that to a notepad (we need it later on). You can also edit additional properties like giving it a custom logo to make it more recognizable if you have an extensive list of app registrations.

Copy Application ID

a4) Edit application registration - Keys
Continue editing and click on Keys in the menu to create a new access key. Give it a suitable name and expirationperiod. After clicking Save make sure to copy the generated key to the same notepad as before since you can only get it once! If you lose it you have to delete and recreate it.

Create new key

a5) Active Directory ID
Now go back to your Azure Active Directory to copy the Directory ID. You can find it when you click on Properties in the menu. Copy this ID to the same notepad that now should contain three values (ApplicationID, generated key and DirectoryID).

Get Directory ID

a6) Setting access root folder
Go to your ADLS and click on Data Explorer. You are now in the root of your ADLS. Click on Access and then on Add to assign new permissions. Search for your Registered Application called 'bitools'. Then select it and click on the Select button. In the root folder we only need Execute permissions. 'Add to' should stay on 'This folder' and 'Add as' should stay on 'An access permission entry'. Click on the Ok button to confirm.

Setting permissions on root folder

If you forget to give Execute permissions to the root folder you will get an error when adding an external table later on:
EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
GETFILESTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.).
[dbd91a77-1b0a-4f11-9710-d7c1b6b05268][2017-10-21T12:26:47.7748687-07:00]: Error [GETFILESTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.).
[dbd91a77-1b0a-4f11-9710-d7c1b6b05268][2017-10-21T12:26:47.7748687-07:00]] occurred while accessing external file.'

a7) Setting access sub folder

Now we have to repeat this for our subfolder called 'mySubFolder'. Click on the folder and you should see the source file. Click on Access and then on Add to assign new permissions. Search for your Registered Application called 'bitools'. Then select it and click on the Select button. In this sub folder we need Read and Execute permissions. 'Add to' should be changed to on 'This folder and all children' and 'Add as' should stay on 'An access permission entry'. Click on the Ok button to confirm.

Setting permissions on sub folder

An alternative could be to give the bitools app read and execute rights on the root including all children. That saves you one step, but is less secure if you use your Data Lake for multiple purposes.

b) SSMS
Now its time to start with the actual PolyBase code, but before we start make sure your Azure SQL Data Warehouse is started and use SQL Server Management Studio (SSMS) to connect to your Data Warehouse. Notice that the icon of a SQL DW is different than SQL DB.

Icon SQL DW vs SQL DB

b1) Master key

In the next step we will use a credential that points to the registered application. To encrypt that credential, we first need to create a master key in our Azure SQL Data Warehouse, but only if you do not already have one. You can check that in the table sys.symmetric_keys. If a row exists where the symmetric_key_id column is 101 (or the name column is '##MS_DatabaseMasterKey##') then you already have a master key. Else we need to create one. For Azure SQL Data Warehouse a masterkey password is optional. For this example we will not use the password.

--Master key
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys WHERE symmetric_key_id = 101)
BEGIN
    PRINT 'Creating Master Key'
    CREATE MASTER KEY;
END
ELSE
BEGIN
    PRINT 'Master Key already exists'
END

b2) Credentials
Next step is to create a credential which will be used to access the subfolder in ADLS. For this you need the ID's and key from the notepad. The IDENTITY has the following format:
[ApplicationID]@https://login.windows.net/[DirectoryID]/oauth2/token
Replace ApplicationID (including the square brackets) with the ID from step a3 and DirectoryID (including the square brackets) with the ID from a5. The SECRET should be filled with the key from step a4. After setting the correct ID's and key, execute the following code:

--Credential
CREATE DATABASE SCOPED CREDENTIAL bitools_user
WITH
    IDENTITY = 'aaf0ab52-560e-40b1-b4df-caac1f0e5376@https://login.windows.net/3xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxb/oauth2/token',
    SECRET = '6LUnE4shZ4p1jUhj7/fkLH03yfbSxi2WRWre9c0yVTs=';
;

Tip:
Give the credential a descriptive name so that you know where it is used for. You can find all credentials in the table sys.database_credentials:

--Find all credential
SELECT * FROM sys.database_credentials

b3) External data source
With the credential from the previous step we will create an External data source that points to the ADLS folder where your file is located. Execute the code below where:

TYPE = HADOOP (because PolyBase uses the Hadoop APIs to access the container)
LOCATION = the connection string to ADLS (replace 'bitools' with the name of your own ADLS name).
CREDENTIAL = the name of the credentials created in the previous step.

--Create External Data Source
CREATE EXTERNAL DATA SOURCE AzureDataLakeStore
WITH (
TYPE = HADOOP,
LOCATION = 'adl://bitools.azuredatalakestore.net',
CREDENTIAL = bitools_user
);

Tip:
Give the external source a descriptive name so that you know where it is used for. You can find all external data sources in the table sys.external_data_sources:

--Find all external sources
SELECT * FROM sys.external_data_sources

Notice that the filename or subfolder is not mentioned in the External Data Source. This is done in the External Table. This allows you to use multiple files from the same folder as External Tables.

b4) External File format
Now we need to describe the format used in the source file. In our case we have a comma delimited file. You can also use this file format to supply the date format, compression type or encoding.

--Create External Data Source
CREATE EXTERNAL FILE FORMAT TextFile
WITH (
    FORMAT_TYPE = DelimitedText,
    FORMAT_OPTIONS (FIELD_TERMINATOR = ',')
);

Tip:
Give the format a descriptive name so that you know where it is used for. You can find all external file formats in the table sys.external_file_formats:

--Find all external file formats
SELECT * FROM sys.external_file_formats

b5) External Table
The last step before we can start quering, is creating the external table. In this create table script you need to specify all columns, datatypes and the filename that you want to read. The filename starts with a forward slash. You also need the datasource from step 3 and the file format from step 4.

--Create External table
CREATE EXTERNAL TABLE dbo.sensordata (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/mySubFolder/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=AzureDataLakeStore, -- from step 3
    FILE_FORMAT=TextFile            -- from step 4
);

Note:
PolyBase does not like columnname headers. It will handle it like a regular data row an throw an error when the datatype doesn't match. There is a little workaround for this with REJECT_TYPE and REJECT_VALUE. However this only works when the datatype of the header is different than the datatypes of the actual rows. Else you have to filter the header row in a subsequent step.

--Create External table with header
CREATE EXTERNAL TABLE dbo.sensordata5 (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
)
WITH (
    LOCATION='/mySubFolder/bitools_sample_data_AveragePerDayPerBuilding.csv',
    DATA_SOURCE=AzureDataLakeStore,
    FILE_FORMAT=TextFile,
    REJECT_TYPE = VALUE, -- Reject rows with wrong datatypes
    REJECT_VALUE = 1     -- Allow 1 failure (the header)
);

You can find all external tables in the table sys.external_tables.

--Find all external tables
SELECT * FROM sys.external_tables

However you can also find the External Table (/the External Data Source/the External File Format) in the Object Explorer of SSMS.

SSMS Object Explorer

b6) Query external table
Now you can query the external table like any other regular table. However the table is read-only so you can not delete, update or insert records. If you update the source file then the data in this external table also changes instantly because the file is used to get the data.

--Testing
SELECT count(*) FROM dbo.sensordata;
SELECT * FROM dbo.sensordata;

Quering an external table

b7) What is next?
Most likely you will be using a CTAS query (Create Table As Select) to copy and transform the data to an other table since this is the fasted/preferred way in SQL DW. In a subsequent post we will explain more about CTAS, but here is how a CTAS query looks like.

--CTAS
CREATE TABLE [dbo].[Buildings]
WITH
(
    DISTRIBUTION = ROUND_ROBIN
,   CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT  [location]
,       [date]
,       [temp]
,       [hmdt]
FROM    [dbo].[sensordata]
;

In some cases you could also use an SELECT INTO query as an alternative for CTAS.

Summary
In this post you saw how easy it was to read a file from the Azure Data Lake Store and use it as a table in Azure SQL Data Warehouse. Although it did required some extra steps compared to PolyBase on an Azure Blob Storage. Jhon Masschelein (B|L|T) has a very helpful post about this matter.

In an other post we will explain the basic usage of the CTAS query which is the preferred way to handle large sets of data in Azure SQL DW and in its on-premises precursor APS (a.k.a. PDW).