Friday 29 March 2019

Azure - Reuse Power BI dataflows

Microsoft recently introduced Power BI dataflows for self-service ETL or data preparation. It uses an Azure Data Lake Storage (ADLS), which is only visible to Power BI, to store the metadata and result of the dataflows. However, you can also bring your own storage account. How does that work and why would you want that?

Bring Your Own Storage in preview

As we mentioned earlier in our previous post, Power BI dataflows can be seen as a self-service data preparation tool. It is easy accessible, so that other people in your organization (besides IT) can get started with transforming and maintaining the data. Nowadays there are often several people / business analysts in a department who are maintaining Excel files. With Power BI dataflows you can centralize this process and contribute to the ideology of "one version of the truth".

When you bring your own ADLS Gen2 storage account (StorageV2) for Power BI dataflows, other services like Azure Data Factory or Azure Databricks could use that same data. This makes it possible to do light weight data preparations with an user-friendly tool for corporate data warehousing or data science. However Power BI dataflows does not (yet?) replace SSIS, Azure Data Factory or any other ETL tool. They could even write back data to the data lake to create 'external' dataflows that are not maintained in Power BI.

Configure and connect to your own storage account
First you have to create an ADLS Gen2 storage account. Make sure the storage account is on the same region as the Power BI tenant. Click here for the documentation of Microsoft that explains step by step how to create such account.

Next, execute the following steps to make a connection with your own storage account. Note that the last step is a one-time action that cannot be changed afterwards.

It requires the following permissions to make a connection:
  • Role "Owner" in the Azure Subscription (service administrator / classic administrator) to add an ADLS Gen2 storage account. This is also required to give the Power BI Service access to the storage account and Blob Container. 
  • Global Administrator in O365 or Azure Active Directory to connect your ADLS Gen2 storage account with dataflows. This has to be done in the Power BI admin portal. The role Power BI Service Administrator is not sufficient to perform those actions.

Power BI Service - Successfully connected to your own ADLS Gen2 storage account

To actual store the dataflow definition and the related data files in your ADLS Gen2 storage account, you must create a new Power BI app workspace or update an existing one. In case of an update, make sure you do not already have dataflows stored in the workspace. Otherwise you cannot change this.

In this case we created a new workspace. Turn on Dataflow storage (preview) under "Advanced" settings. In this case we built the same (simple) dataflow as our previous post. Save your dataflow and click "Refresh".

Azure Storage Explorer
When the dataflow is refreshed, go to the Blob container you created earlier for your ADLS Gen2 storage account. Here you will find the definition (source code) of the dataflow and the output. Note that the content of the ADLS Gen2 storage account is only visible in Azure Storage Explorer. Click here to download.

Azure Data Explorer - Result in own ADLS Gen2 storage account

And now you can choose per workspace whether you want to use this new Data Lake. This only works for the new workspaces, that are at the moment also still in preview.
Power BI Workspace settings

Common errors
When you try to connect to your own storage account in the admin portal of the Power BI Service, you can get several errors.

You must have global administrator permissions
Only Global Administrators in Office 365 or Azure Active Directory are administrators in Power BI and therefore able to connect to the storage account. Click here for more information about administrator roles in Power BI.

There was a problem accessing your dataflow storage account
After creating your storage account , it can take up to 30 minutes to make a connection. Also make sure you avoid spelling mistakes.

Your storage account must be in the same Azure Active Directory tenant
This occurs for example when you are trying to make a connection with the ADLS storage account in the Power BI Service with an account outside the organization. In this case, the organization (subscription) is where the storage account is created.

In this post we showed you how to use your own Azure Data Lake Storage account instead of the default provided by Power BI using dataflows. This new feature has several possible use cases. For example:
  1. Data preparation for 'corporate' data warehousing by a business user with a user-friendly tool
  2. Data preparation for 'corporate' data science by a business user with a user-friendly tool
  3. Creating 'external' dataflows for Power BI with Azure Services like Data Factory or Databricks
  4. Using data from other CDM-compliant applications like Dynamics 365 and Office 365

We hope Microsoft will make it a bit easier to bring your own storage, because at the moment there are a lot of steps to take and you need a lot of rights to do it. This discourages to arrange this powerful option.

We also expect some more admin capabilities in Power BI, because at this moment you cannot change your dataflow storage once Power BI is connected to your own ADLS account (so be very careful). And for larger corporations one ADLS account is probably not enough. It is also expected that the integration and collaboration with other Azure services will be improved so that you are even more flexible in choosing the services in your BI landscape.

Wednesday 14 February 2018

Cognitive functions U-SQL: text sentiment

U-SQL has cognitive capabilities to analyse a text on sentiment. How does that work? Do I need Azure Cognitive-services?
U-SQL Cognitive Capabilities

Good news is that you only need Azure Data Lake (Analytics and Store) with a U-SQL job. Downside is that U-SQL does not yet have the full functionality of Azure Cognitive Services, but all the basics are available. This blog post describes the text sentiment analysis, but there is a second text analysis capability for Key phrases extraction which will be handled in an future post.

Please: see our blog post about Image Tagging with U-SQL in Data Lake if you have not yet installed the Cognitive Functions for U-SQL that we will be using for this post.

Starting point
The starting point of this blog post is an Azure Data Lake Analytics (ADLA) that is connected to an Azure Data Lake Store (ADLS) with some texts to analyse. For this example we used a transscript of Obama's Victory speech from 2008, but you could for example also use a transcript of Trump's Davos speech from a few weeks ago. The text file will be stored in an ADLS folder called TextSentiment.

Text sentiment
The text sentiment analysis will return two columns for each row with text. The first column is the sentiment classification: Positive, Negative or Neutral. The second column is a score between negative 1 and positive 1. Where a negative number close to -1 means it is very sure that the text is negative and vice versa where a positive number close to 1 means that it is very sure that the text is positive. This also allows you to take an average on the entire text file to get an overall score.

1) Create new job
On the ADLA overview page click on +New Job and then give it a suitable name before we start coding. A descriptive name allows you to find your script/job in the overview of all jobs.
Create new U-SQL job

2) Referencing assemblies
The cognitive scripts in U-SQL always start with adding references. For text sentiment we need to add a reference to the assembly "TextSentiment".
// Needed for text sentiment

3) Extract text file
Next step is to extract the text file with the transcript from the ADLS container and store them in a rowset variable called @speech. Each row in the transcript text file contains one paragraph of text. Therefor the we will use Extractors.Text() and only one string column. We replaced the default delimiter with something that doesn't occur in the text (|-pipeline) and if it does the silent option will ignore it and continue without throwing errors.

The extraction script looks a bit like a T-SQL SELECT statement, but because we are getting unstructured data it starts with EXTRACT instead of SELECT and we need to specify the data type for each column we extract (Schema on Read). The FROM does not get the data from a table, but from a file in the ADLS container called "TextSentiment".
// Get the transcript file from the ADLS container
@speech =
    EXTRACT Text string
    FROM @"/TextSentiment/ObamasVictorySpeechTranscript.txt"
    USING Extractors.Text(silent: true, delimiter: '|');

4) Transform data
The method that analyses the text for sentiment takes one readonly input column and three output columns: the original text, classification and confidence. The confidence column can be turned off/on with the Boolean parameter (see code). The name (or datatype) of the output columns cannot be changed.
//Analyse the text and return classification and confidence
@sentiment =
    PROCESS @speech
    PRODUCE Text,
            Sentiment string,
            Conf double
    USING new Cognition.Text.SentimentAnalyzer(true); // True adds the confidence column

5) Output data
Now we can output the data to a file in an ADLS container. In the first output we will see a score per line and in a second output we will aggregate the confidence column to get an overal score.
// Output sentiment per line
OUTPUT @sentiment
TO "/TextSentiment/SentimentAnalyzerObama.txt"
USING Outputters.Csv(outputHeader: true);

// Aggregate the Confidence to get an overall score
// Note: it doesn't take into account the length of
// each row. You can find the length with Text.Length
@average =
    SELECT AVG(Conf) AS OverallSentimentScore
    FROM @sentiment;

// Output overall score sentiment 
OUTPUT @average
TO "/TextSentiment/SentimentAnalyzerObamaOverall.txt"
USING Outputters.Csv(outputHeader: true);
Download the complete script here.

The result
Now the script is ready to run. Click on the submit button and wait for the job to finish. This could take a few moments! Then browse to the ADLS folder and preview the file to see the result.
The result with the sentiment and score per paragraph

An other new options to view the result with the Azure Storage Explorer. This new Microsoft tool allows you to browse to your storage accounts and data lake stores to download the result of your U-SQL query.
Azure Storage Explorer

In this post you saw how to analyse texts for sentiment. Analyzing media like Twitter or Facebook or emails to/from your helpdesk is probably more interesting then speeches from presidents of the United States. Some might say that the overall score with the AVG is perhaps a bit arbitrary because it shows the confidence, but combine it with the text length and it will give some good insights on the entire text.

As said before the text sentiment in Azure Cognitive Services - Text Analytics API has some additional options like support of multiple languages and language detection, but we will show that in a future post.

Thursday 16 November 2017

Cognitive functions U-SQL: emotion, age & gender

U-SQL has cognitive capabilities to analyse pictures of persons to detect age, gender and emotions. How do they work and do I need Azure Cognitive Service?
U-SQL Cognitive Capabilities

Good news is that you only need Azure Data Lake (Analytics and Store) with a U-SQL job. Downside is that U-SQL does not yet have the full functionality of Azure Cognitive Services, but all the basics are available. In a previous blog post we showed the basics of the cognitive capabilities in U-SQL and an example of tagging images to add descriptive labels to it. If you never used U-SQL before then first read that post. This follow-up post continues with two new examples. Detecting  emotions and detecting age & gender .

Starting point
The starting point of this blog post is an Azure Data Lake Store (ADLS) with a collection of 'random' pictures of humans. We have a folder called 'faces' that contains random images which we wil use for these next two examples.
Test faces

1) Emotions Script
The emotion script scans the pictures for faces and then tries to determine the emotion of each face (anger, contempt, disgust, fear, happiness, neutral, sadness, surprise). For each face it shows where it was located in the picture and then shows its emotion and the confidence rate for that emotion.
Me a few weeks ago at a party

Referencing assemblies
For emotion scanning we need one extra reference called "ImageEmotion".
// Needed for image extraction and emotions

Extract image files
This code, to extract image files from an ADLS container, is exactly the same as in the previous examples .
// Get the image data from ADLS container
@images =
    EXTRACT     FileName string, 
                ImgData byte[]
    FROM        @"/faces/{FileName}.jpg"
    USING new Cognition.Vision.ImageExtractor();

Transform data
Scanning the images for faces and their emotion is done by cross joining the images rowset to the EmotionApplier method. The column names, datatypes and column order are fixed, but you can add aliases for different column names or change the order in the SELECT part of the query.

The query returns one record per face on the image. Besides the emotion you also get a confidence rate, the number of faces, the face number and the position on the image.
// Query detects emotion and the confidence
// If there are multiple faces it creates
// one record for each face. It also show
// the position of the face on the picture.
@emotions =
    SELECT FileName.ToLower() AS FileName,
    FROM @images 
        USING new Cognition.Vision.EmotionApplier() AS Details(
            NumFaces int, 
            FaceIndex int, 
            RectX float,
            RectY float,
            Width float,
            Height float, 
            Emotion string, 
            Confidence float);

Output data
This is the same code as in the previous examples to output the detected emotions to a file in an ADLS container.
// Output the emotions rowset to a CSV file
// located in the Azure Data Lake Store
OUTPUT @emotions
    TO "/faces/emotions.csv"
    ORDER BY    FileName
    USING Outputters.Csv(outputHeader: true);
Download the complete script here.

The result
Now the emotion script is ready to run. Click on the submit button and wait for the job to finish. This could take a few moments! Then browse to the ADLS folder and preview the file to see the result.
The result with in red the happy man from above

2) Age/gender Script
The age/gender script scans the pictures for faces and then tries to determine the age en gender of each face. It is very similar to the emotion script.
Me at 43

Referencing assemblies
For age and gender scanning we need one extra reference called "FaceSdk".
// Needed for image extraction and age/gender

Extract image files
Again the same code as in the previous examples to extract image files from an ADLS container.
// Get the image data from ADLS container
@images =
        FileName string, 
        ImgData byte[]
    FROM @"/faces/{FileName}.jpg"
    USING new Cognition.Vision.ImageExtractor();

Transform data
Scanning the images for age and gender and their emotion is done by cross joining the images rowset to the EmotionApplier method. The columnnames, datatypes and order are fixed, but you can add aliases for different columnnames.

The query returns one record per face on the image. Besides the age and gender you also get the number of faces, the face number and the position on the image.
// Query detects age and gender
// If there are multiple faces it creates
// one record for each face. It also show
// the position of the face on the picture.
@faces_analyzed =
    SELECT FileName.ToLower() AS FileName,
        Details.RectX, Details.RectY, Details.Width, Details.Height,
    FROM @images
        USING new Cognition.Vision.FaceDetectionApplier() AS Details(
            NumFaces int, 
            FaceIndex int, 
            RectX float, RectY float, Width float, Height float, 
            FaceAge int, 
            FaceGender string);

Output data
Outputting the data to ADLS uses the same code as in the previous examples.
// Output the gender and age rowset to a CSV file
// located in the Azure Data Lake Store
OUTPUT @faces_analyzed
    TO "/faces/agegender.csv"
    USING Outputters.Csv(outputHeader: true);
Download the complete script here.

The result
Now the age and gender script is ready to run. Click on the submit button and wait for the job to finish. This could take a few moments! Then browse to the ADLS folder and preview the file to see the result.
The result with my photo in red

This post showed you how to use U-SQL to detect emotion, age and gender from pictures. The next step could be to join these examples in one big script. When you want to try that, keep in mind that the ON clause uses two = instead of one (C# instead of TSQL): ON a.FileName == e.FileName. If you want to try these scripts your self, then you can only do that in the Azure portal. The U-SQL projects for Visual Studio do not yet support these extensions.

As said before the functionality in U-SQL is not yet the same as in Azure Cognitive Services which has much more options (and there my age was estimated at 39 with the same picture). Hopefully this will change, but for now the basics are working. Keep an eye on the Data Lake topic page where we will post new examples when more functionality is available.

Cognitive functions U-SQL: image tagging

U-SQL has cognitive capabilities to analyse images. How do they work? Do I need Azure Cognitive-services?
U-SQL Cognitive Capabilities

Good news is that you only need Azure Data Lake (Analytics and Store) with a U-SQL job. Downside is that U-SQL does not yet have the full functionality of Azure Cognitive Services, but all the basics are available. This blog post starts with a very simple image extraction script to explain the basics of the U-SQL cognitive functions. In the second example we will tag images to add descriptive labels to them.

In a second post we will also show how to detect faces on images and extract emotion, gender and age from them. The base of these scripts are all very similar.

Starting point
The starting point of this blog post is an Azure Data Lake Store (ADLS) with a collection of 'random' images. We have a folder called 'objects' that contains random object images which we wil use for these first two scripts.
The content of ADLS container with random google image pictures

Create ADLA environment
To start we need to create a new Azure Data Lake Analytics (ADLA) environment and connect it to the existing ADLS with the image collection. Go to the Azure portal and click on New in the top left corner of the dashboard and locate ADLA under "Data + Analytics". Supply the basic stuff like name, subscription, resource group and location. One of the last steps is selecting the ADLS (or create a new one). Unless you have a good reason to deviate, it is wise to use the same location for ADLS and ADLA to prevent unnecessary data trafic around the world which could make your queries slower and therefore costing you extra money.
Creating new ADLA and connect it to ADLS

Install U-SQL Extensions
To make use of the cognitive functions in U-SQL, we first need to install the extensions. Go to Sample Scripts in the menu of ADLA and then click on Install U-SQL Extensions in the header. This assembly installation takes a few minutes, but you only have to do this once per ADLA.
Install U-SQL extensions

You can check the internal database in the Data Explorer to see which assemblies are installed. The Data Explorer button can be found on the ADLA overview page in the header.
Check which assemblies are installed

A) Basic script
Let's start with a very basic example: Extracting image files from an ADLS container and create a CSV file with all filenames in it.

1) Create new job
On the ADLA overview page click on +New Job and then give it a suitable name before we start coding.
Create new U-SQL job

2) Referencing assemblies
The cognitive image scripts in U-SQL always start with adding references. For image extraction we need to add a reference to "ImageCommon".
// Needed for image extraction

3) Extract image files
Next step is to extract the actual files from the ADLS container and store them in a rowset variable called @images. The ImageExtractor method can only get the filename and the actual bytes of the file. The order and datatype of these columns are fixed, but you can use different column names.

It looks a bit like a T-SQL SELECT statement, but because we are getting unstructured data it starts with EXTRACT instead of SELECT and we need to specify the data type. The FROM does not get the data from a table, but from the ADLS container called "objects" and the construction with {FileName}.jpg is a wildcard to only get JPG images from that container.
// Get the image data from ADLS container
@images =
    EXTRACT     FileName string,
                ImageData byte[]
    FROM        @"/objects/{FileName}.jpg"
    USING new Cognition.Vision.ImageExtractor();

4) Transform data
For our CSV with filenames we only want to extract the filename column from the rowset variable called @images. This is done with a very simple SELECT query on the rowset variable from the previous step to extract the required data.
// Create a list of filenames
@result = 
    SELECT      FileName
    FROM        @images;

You can add an ORDER BY clause, but it requires to add FETCH to specify the number of rows that you want to select and sort. By default the ORDER BY is case sensitive (just like C#). You can overcome this by adding .ToLower() after the column name.
// Create a orderd list of filenames
// Note 1: ORDER BY requires the FETCH option to supply the nummer of rows
// Note 2: ORDER BY is case sensitive. Workaround: add .ToLower() 
// Note 3: ORDER BY can be moved to OUTPUT section (below TO)
@result = 
    SELECT      FileName
    FROM        @images
    ORDER BY    FileName.ToLower() 
    FETCH       10 ROWS;

5) Output data
Last step is to save the data in a CSV file in an ADLS container. In this example we are outputting the rowset variable @result that was created in the previous step. The outputter.csv has many options to format your CSV file, but they are all optional.
// Output the rowset to a CSV file located in the Azure Data Lake Store
OUTPUT @result
    TO "/objects/filenamelist.csv"
    USING Outputters.Csv(outputHeader: true, quoting: false);

Instead of a hardcoded path in the OUTPUT section you could also use a variable to move the hardcoded part to the top of your script.
// Declare where the result should be stored
DECLARE @outputpath string = "/objects/filenamelist.csv";

// Output the rowset to a CSV file located in the Azure Data Lake Store with variable
OUTPUT @result
    TO @outputpath
    USING Outputters.Csv();

There is an alternative place for the ORDER BY. You can also add it in the OUTPUT section right below the TO clause. It does not allow the FETCH option, which is a good thing, but it also does not allow the .ToLower() workaround (causing a case sensitive ordering). You could solve that by lowering it in the @result rowset instead.
// Create a list of filenames (lowercase)
@result = 
    SELECT      FileName.ToLower() AS FileName
    FROM        @images;

// Output the rowset to a CSV file located in the Azure Data Lake Store
// ORDERED BY filename descending.
OUTPUT @result
    TO "/objects/filenamelist.csv"
    ORDER BY    FileName DESC
    USING Outputters.Csv(outputHeader: true);
Download the complete script here.

6) Run Job
Now the script is ready to run. To improve the performance we increase the AUs a little bit, but this increases the costs. In a later post the optimal settings will be explained. Then click on the submit button and wait for the job to finish. This could take a few moments!
Running the job (not the actual speed)

7) The result
When the job has finished you can preview the result file in ADLS. Use the Data Explorer to browse to the folder and then preview the generated CSV file.
Preview result in Data Explorer

B) Tagging script
Image tagging means that it will scan the images and add descriptive words to it including a probability rate to show you how certain it is about that particular word. If you have a picture of someone cycling in the mountains then it will add words like bicycle, mountain, outdoor, person, sky.

Referencing assemblies
For image tagging we need one extra reference called "ImageTagging".
// Needed for image extraction and tagging

Extract image files
This is the same code as in the previous example to extract image files from an ADLS container.
// Get the image data from ADLS container
@images =
    EXTRACT     FileName string, 
                ImgData byte[]
    FROM        @"/objects/{FileName}.jpg"
    USING new Cognition.Vision.ImageExtractor();

Transform data
Tagging the images is a two step action where it first adds (zero, one or) multiple tags and the probability in value pairs. The second step is to convert all those value pairs to a string which we can export. It also shows the number of tags added.
// Process the images and add multiple tag pairs (tag and probability rate)
// NumObjects contains the number of tag pairs added to the image
@tags =
    PROCESS  @images 
    PRODUCE  FileName,
             NumObjects int,
             Tags SQL.MAP<string, float?>
    READONLY FileName
    USING new Cognition.Vision.ImageTagger();

// We need to convert the tagpairs to a string which we can export
// The string will look like: bicycle:0.9998484;outdoor:0.9164549;transport:0.7914466
@tags_serialized =
    SELECT  FileName.ToLower() AS FileName,
    NumObjects AS TagsCount,
    String.Join(",", Tags.Select(x => String.Format("{0}:{1}", x.Key, x.Value))) AS TagsString
    FROM  @tags;

Output data
This is the same code as in the previous example to output the filename and tags to a file in an ADLS container. Only the variablename and filename did change.
// Output the rowset to a CSV file located in the Azure Data Lake Store
OUTPUT @tags_serialized
    TO "/objects/tagging.csv"
    ORDER BY    FileName
    USING Outputters.Csv(outputHeader: true);
Download the complete script here.

The result
Now the script is ready to run. Click on the submit button and wait for the job to finish. Again, this could take a few moments! Then browse to the ADLS folder and preview the file to see the result.
The result with in red the cyclist from above

The tagging in Azure Cognitive Services - Computer vision API has some additional options, but we will show that in a future post.

In this post you saw how to extract images from ADLS and process them with U-SQL in ADLA. We also showed how tagging of images works and in the next post we will handle the scanning of faces for emotions, gender and age. If you want to try these scripts your self, then you can only do that in the Azure portal. The U-SQL projects for Visual Studio do not yet support these extensions.

Monday 23 October 2017

Use PolyBase to read Data Lake in Azure SQL DW

I have a file in an Azure Data Lake Store (ADLS) folder which I want to use in my Azure SQL Data Warehouse. In a previous blog post you used PolyBase to get the data from an Azure Blob Storage container via its access keys. How can I use PolyBase to get the data from ADLS and push the content of that file to Azure SQL DW?
Azure SQL Data Warehouse - PolyBase on ADLS

In the previous blog post we showed how to read that file from an Azure Blob Storage container via its access keys using PolyBase. However ADLS does not work with those keys, but uses the Azure Active Directory to provide access. To get authorization via Azure Active Directory we need to register a 'Web app / API'  application in Azure Active Directory that does the authorization for us. That sounds very difficult and the documentation on MSDN is not very helpful, but in the end it was quite easy.

a) Starting point
The starting point of this blog post is a file in an ADLS folder called 'mySubFolder'. The file was created in a previous blog post about U-SQL that can quickly process large amounts of data files in Azure Data Lake Analytics (ADLA). The name of our ADLS is 'bitools'.
Starting point: CSV file in ADLS
The content of the CSV file

a1) App registrations
Go to the Azure portal and search for Azure Active Directory in the search box located in the header. This will bring you to the Azure Active Directory from your subscription. Then click on App registrations in the menu. It will show a list of all existing registrations. Next step is to click on New application registration to create a new registration for our data lake.
New application registration

a2) New application registration
Enter a new descriptive name like 'Data Lake bitools' so you will know where it is used for. Choose 'Web app / API' as Application type and then you need to enter a URL. Since we are not using the Sign-on URL property (we use the Azure sign-on), you can just enter any url like ''. When complete click on the Create button.
New application registration

a3) Edit application registration - Application ID
Now search your newly created Application registration to get its Application ID. Copy that to a notepad (we need it later on). You can also edit additional properties like giving it a custom logo to make it more recognizable if you have an extensive list of app registrations.
Copy Application ID

a4) Edit application registration - Keys
Continue editing and click on Keys in the menu to create a new access key. Give it a suitable name and expirationperiod. After clicking Save make sure to copy the generated key to the same notepad as before since you can only get it once! If you lose it you have to delete and recreate it.
Create new key

a5) Active Directory ID
Now go back to your Azure Active Directory to copy the Directory ID. You can find it when you click on Properties in the menu. Copy this ID to the same notepad that now should contain three values (ApplicationID, generated key and DirectoryID).
Get Directory ID

a6) Setting access root folder
Go to your ADLS and click on Data Explorer. You are now in the root of your ADLS. Click on Access and then on Add to assign new permissions. Search for your Registered Application called 'bitools'. Then select it and click on the Select button. In the root folder we only need Execute permissions. 'Add to' should stay on 'This folder' and 'Add as' should stay on 'An access permission entry'. Click on the Ok button to confirm.
Setting permissions on root folder

If you forget to give Execute permissions to the root folder you will get an error when adding an external table later on:
EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
GETFILESTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.).
[dbd91a77-1b0a-4f11-9710-d7c1b6b05268][2017-10-21T12:26:47.7748687-07:00]: Error [GETFILESTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.).
[dbd91a77-1b0a-4f11-9710-d7c1b6b05268][2017-10-21T12:26:47.7748687-07:00]] occurred while accessing external file.'

a7) Setting access sub folder
Now we have to repeat this for our subfolder called 'mySubFolder'. Click on the folder and you should see the source file. Click on Access and then on Add to assign new permissions. Search for your Registered Application called 'bitools'. Then select it and click on the Select button. In this sub folder we need Read and Execute permissions. 'Add to' should be changed to on 'This folder and all children' and 'Add as' should stay on 'An access permission entry'. Click on the Ok button to confirm.
Setting permissions on sub folder

An alternative could be to give the bitools app read and execute rights on the root including all children. That saves you one step, but is less secure if you use your Data Lake for multiple purposes.

Now its time to start with the actual PolyBase code, but before we start make sure your Azure SQL Data Warehouse is started and use SQL Server Management Studio (SSMS) to connect to your Data Warehouse. Notice that the icon of a SQL DW is different than SQL DB.

b1) Master key
In the next step we will use a credential that points to the registered application. To encrypt that credential, we first need to create a master key in our Azure SQL Data Warehouse, but only if you do not already have one. You can check that in the table sys.symmetric_keys. If a row exists where the symmetric_key_id column is 101 (or the name column is '##MS_DatabaseMasterKey##') then you already have a master key. Else we need to create one. For Azure SQL Data Warehouse a masterkey password is optional. For this example we will not use the password.
--Master key
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys WHERE symmetric_key_id = 101)
    PRINT 'Creating Master Key'
    PRINT 'Master Key already exists'

b2) Credentials
Next step is to create a credential which will be used to access the subfolder in ADLS. For this you need the ID's and key from the notepad. The IDENTITY has the following format:
Replace ApplicationID (including the square brackets) with the ID from step a3 and DirectoryID (including the square brackets) with the ID from a5. The SECRET should be filled with the key from step a4. After setting the correct ID's and key, execute the following code:
    IDENTITY = 'aaf0ab52-560e-40b1-b4df-caac1f0e5376@',
    SECRET = '6LUnE4shZ4p1jUhj7/fkLH03yfbSxi2WRWre9c0yVTs=';

Give the credential a descriptive name so that you know where it is used for. You can find all credentials in the table sys.database_credentials:
--Find all credential
SELECT * FROM sys.database_credentials

b3) External data source
With the credential from the previous step we will create an External data source that points to the ADLS folder where your file is located. Execute the code below where:
  • TYPE = HADOOP (because PolyBase uses the Hadoop APIs to access the container)
  • LOCATION = the connection string to ADLS (replace 'bitools' with the name of your own ADLS name).
  • CREDENTIAL = the name of the credentials created in the previous step.
--Create External Data Source
LOCATION = 'adl://',
CREDENTIAL = bitools_user

Give the external source a descriptive name so that you know where it is used for. You can find all external data sources in the table sys.external_data_sources:
--Find all external sources
SELECT * FROM sys.external_data_sources

Notice that the filename or subfolder is not mentioned in the External Data Source. This is done in the External Table. This allows you to use multiple files from the same folder as External Tables.

b4) External File format
Now we need to describe the format used in the source file. In our case we have a comma delimited file. You can also use this file format to supply the date format, compression type or encoding.
--Create External Data Source
    FORMAT_TYPE = DelimitedText,

Give the format a descriptive name so that you know where it is used for. You can find all external file formats in the table sys.external_file_formats:
--Find all external file formats
SELECT * FROM sys.external_file_formats

b5) External Table
The last step before we can start quering, is creating the external table. In this create table script you need to specify all columns, datatypes and the filename that you want to read. The filename starts with a forward slash. You also need the datasource from step 3 and the file format from step 4.
--Create External table
CREATE EXTERNAL TABLE dbo.sensordata (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
    DATA_SOURCE=AzureDataLakeStore, -- from step 3
    FILE_FORMAT=TextFile            -- from step 4
PolyBase does not like columnname headers. It will handle it like a regular data row an throw an error when the datatype doesn't match. There is a little workaround for this with REJECT_TYPE and REJECT_VALUE. However this only works when the datatype of the header is different than the datatypes of the actual rows. Else you have to filter the header row in a subsequent step.
--Create External table with header
CREATE EXTERNAL TABLE dbo.sensordata5 (
    [Date] DateTime2(7) NOT NULL,
    [temp] INT NOT NULL,
    [hmdt] INT NOT NULL,
    [location] nvarchar(50) NOT NULL
    REJECT_TYPE = VALUE, -- Reject rows with wrong datatypes
    REJECT_VALUE = 1     -- Allow 1 failure (the header)
You can find all external tables in the table sys.external_tables.
--Find all external tables
SELECT * FROM sys.external_tables
However you can also find the External Table (/the External Data Source/the External File Format) in the Object Explorer of SSMS.
SSMS Object Explorer

b6) Query external table
Now you can query the external table like any other regular table. However the table is read-only so you can not delete, update or insert records. If you update the source file then the data in this external table also changes instantly because the file is used to get the data.
SELECT count(*) FROM dbo.sensordata;
SELECT * FROM dbo.sensordata;
Quering an external table

b7) What is next?
Most likely you will be using a CTAS query (Create Table As Select) to copy and transform the data to an other table since this is the fasted/preferred way in SQL DW. In a subsequent post we will explain more about CTAS, but here is how a CTAS query looks like.
CREATE TABLE [dbo].[Buildings]
SELECT  [location]
,       [date]
,       [temp]
,       [hmdt]
FROM    [dbo].[sensordata]

In some cases you could also use an SELECT INTO query as an alternative for CTAS.

In this post you saw how easy it was to read a file from the Azure Data Lake Store and use it as a table in Azure SQL Data Warehouse. Although it did required some extra steps compared to PolyBase on an Azure Blob Storage. Jhon Masschelein (B|L|T) has a very helpful post about this matter.

In an other post we will explain the basic usage of the CTAS query which is the preferred way to handle large sets of data in Azure SQL DW and in its on-premises precursor APS (a.k.a. PDW).

Tuesday 29 August 2017

Azure - Continue with Azure Data Lake for Big Data

I an earlier post we showed you how to transform sensor data using Azure Data Lake. Many companies are gathering (or already have) a lot of Big Data in many different files. How can we use Azure Data Lake Analytics (ADLA) to handle these files?

Big Data and U-SQL

Just like the previous post, the sensor data is already stored in an Azure Data Lake Store (ADLS). Next, we build and configure an U-SQL Job. This is Microsoft's new Big Data query language that you can use in ADLA. Last time we developed in the Azure Portal, but there are other options. Last month, Microsoft released a Visual Studio plug-in for Azure Data Lake and Stream Analytics. This allows you, while writing U-SQL queries, to use other benefits of Visual Studio like Team Foundation Server (TFS), debugging and adding C# code for custom inputs and outputs.

In this case we have sensor data from one year. The data is stored in several files: one file per day. We want to create a U-SQL job that aggregates the data per day and then stores all the data. For now we focus on the query itself. See here how to create an ADLA service/account and to create a new U-SQL Job.

1) Install plug-in for Visual Studio
First we have to download and install the plug-in Microsoft Azure Data Lake and Stream Analytics Tools for Visual Studio. You can download the plug-in here. Besides the creating and debugging of U-SQL scripts, you can also build queries of Azure Stream Analytics jobs using this plug-in.

2) Write the Query
Open Visual Studio and create a new U-SQL project. Our U-SQL script is called 'multipleFiles'. The starting point is the query we made in an earlier post extracting one single sensor file.

Because we have multiple files, we are creating a dynamic FROM clause using variables. In this case the folder path from ADLS. We use the following syntax for this:"bitools_sample_data_{*}.csv". This is a wildcard and will get you every file of the year (see comment in the query below for the input files structure). We also skip the first row, the headers.

// File naming convention: bitools_sample_data_01-01-2016.csv, bitools_sample_data_01-02-2016.csv etc.
// Create variable for input files
DECLARE @folderInput string = "/SensorData/Input/";
DECLARE @inputString string = @folderInput + "bitools_sample_data_{*}.csv";

To retrieve the data from the files, we use an EXTRACT statement. In an earlier post, we extracted the data as a string. Now we extract the 'time' column as date time format (just like the source file), using the variable in the FROM clause we created earlier.

// Extract the sensor data from CSV file (skip the header)
@sensorData = 
        [time]                    DateTime
    ,   [dsplid]                  string
    ,   [dspl]                    string
    ,   [temp]                    string
    ,   [hmdt]                    string
    ,   [status]                  string
    ,   [location]                string
    ,   [EventProcessedUtcTime]   string
    ,   [PartitionId]             string
    ,   [EventEnqueuedUtcTime]    string
    FROM @inputString
    USING Extractors.Csv(skipFirstNRows:1);

Next we aggregate the data into averages based on the 'time' and 'location' column, using a SELECT statement. We convert the 'time' column to a date format, because we want to aggregate per day. We give the column names a suitable name. You may have noticed that we do not select all the columns, because we do not need all columns from the source file.

// Aggregate the sensor data (average per location) and data type conversions
@result =
        time.ToString("yyyy-MM-dd") AS Date
    ,   AVG(Convert.ToInt32([temp])) AS Temperature
    ,   AVG(Convert.ToInt32([hmdt])) AS Humidity
    ,   [location] AS Location
    FROM @sensorData
    ,   [location];

Finally, we save the data in a new CSV file. In the OUTPUT statement, you can also add an ORDER BY clause. We want the header back in our output data and therefore we use 'outputHeader'.

// Save the sensor data to a new CSV file
OUTPUT @result
TO "/SensorData/Output/bitools_sample_data_AveragePerDayPerLocation.csv"
    [Location] ASC
USING Outputters.Csv(outputHeader : true, quoting:false);

See below a screenshot of the full query in Visual Studio.

Visual Studio - U-SQL script

3) Run the Job
When you have built the query, click 'Submit' and then the Job View screen automatically appears. This is similar to Job Details in the Azure Portal that we used earlier. But when you look closely, you see Visual Studio offers more information then the portal. For example, more details at 'Job Summary' and errors details.

Visual Studio - Run U-SQL script

Error details
When you have an error in the U-SQL query, you can see often the details of this error directly in the 'Job View' screen. In case of an Vertex user code error, you do not immediately see the error details on this screen. If you want to see details of this error, scroll down in the 'Job Summary' and click on 'Resources'. Then choose 'Profile' and search for the keyword 'jobError'. This row contains the details of the error.

Visual Studio - U-SQL Query error details

4) Result
Now go to the Azure portal and to your Azure Data Lake Store. Open the new file in 'Data Explorer'. Our output file is located in the folder 'SensorData' and then 'Output'. The result should look like this:

Azure Portal - View result in Data Lake Store

In this post we went deeper into building an U-SQL script using Visual Studio. In our opinion, you should develop as much as possible in Visual Studio, because we all know the benefits of this tool like TFS and debugging.