Tuesday, 29 August 2017

Azure - Continue with Azure Data Lake for Big Data


Case
I an earlier post we showed you how to transform sensor data using Azure Data Lake. Many companies are gathering (or already have) a lot of Big Data in many different files. How can we use Azure Data Lake Analytics (ADLA) to handle these files?

Big Data and U-SQL















Solution
Just like the previous post, the sensor data is already stored in an Azure Data Lake Store (ADLS). Next, we build and configure an U-SQL Job. This is Microsoft's new Big Data query language that you can use in ADLA. Last time we developed in the Azure Portal, but there are other options. Last month, Microsoft released a Visual Studio plug-in for Azure Data Lake and Stream Analytics. This allows you, while writing U-SQL queries, to use other benefits of Visual Studio like Team Foundation Server (TFS), debugging and adding C# code for custom inputs and outputs.

In this case we have sensor data from one year. The data is stored in several files: one file per day. We want to create a U-SQL job that aggregates the data per day and then stores all the data. For now we focus on the query itself. See here how to create an ADLA service/account and to create a new U-SQL Job.

1) Install plug-in for Visual Studio
First we have to download and install the plug-in Microsoft Azure Data Lake and Stream Analytics Tools for Visual Studio. You can download the plug-in here. Besides the creating and debugging of U-SQL scripts, you can also build queries of Azure Stream Analytics jobs using this plug-in.

2) Write the Query
Open Visual Studio and create a new U-SQL project. Our U-SQL script is called 'multipleFiles'. The starting point is the query we made in an earlier post extracting one single sensor file.

Variables
Because we have multiple files, we are creating a dynamic FROM clause using variables. In this case the folder path from ADLS. We use the following syntax for this:"bitools_sample_data_{*}.csv". This is a wildcard and will get you every file of the year (see comment in the query below for the input files structure). We also skip the first row, the headers.

// File naming convention: bitools_sample_data_01-01-2016.csv, bitools_sample_data_01-02-2016.csv etc.
// Create variable for input files
DECLARE @folderInput string = "/SensorData/Input/";
DECLARE @inputString string = @folderInput + "bitools_sample_data_{*}.csv";

Extract
To retrieve the data from the files, we use an EXTRACT statement. In an earlier post, we extracted the data as a string. Now we extract the 'time' column as date time format (just like the source file), using the variable in the FROM clause we created earlier.

// Extract the sensor data from CSV file (skip the header)
@sensorData = 
    EXTRACT 
        [time]                    DateTime
    ,   [dsplid]                  string
    ,   [dspl]                    string
    ,   [temp]                    string
    ,   [hmdt]                    string
    ,   [status]                  string
    ,   [location]                string
    ,   [EventProcessedUtcTime]   string
    ,   [PartitionId]             string
    ,   [EventEnqueuedUtcTime]    string
    FROM @inputString
    USING Extractors.Csv(skipFirstNRows:1);

Transformation
Next we aggregate the data into averages based on the 'time' and 'location' column, using a SELECT statement. We convert the 'time' column to a date format, because we want to aggregate per day. We give the column names a suitable name. You may have noticed that we do not select all the columns, because we do not need all columns from the source file.

// Aggregate the sensor data (average per location) and data type conversions
@result =
    SELECT
        time.ToString("yyyy-MM-dd") AS Date
    ,   AVG(Convert.ToInt32([temp])) AS Temperature
    ,   AVG(Convert.ToInt32([hmdt])) AS Humidity
    ,   [location] AS Location
    FROM @sensorData
    GROUP BY
        time.ToString("yyyy-MM-dd")
    ,   [location];

Output
Finally, we save the data in a new CSV file. In the OUTPUT statement, you can also add an ORDER BY clause. We want the header back in our output data and therefore we use 'outputHeader'.

// Save the sensor data to a new CSV file
OUTPUT @result
TO "/SensorData/Output/bitools_sample_data_AveragePerDayPerLocation.csv"
ORDER BY 
    [Location] ASC
USING Outputters.Csv(outputHeader : true, quoting:false);

See below a screenshot of the full query in Visual Studio.

Visual Studio - U-SQL script





















3) Run the Job
When you have built the query, click 'Submit' and then the Job View screen automatically appears. This is similar to Job Details in the Azure Portal that we used earlier. But when you look closely, you see Visual Studio offers more information then the portal. For example, more details at 'Job Summary' and errors details.


Visual Studio - Run U-SQL script















Error details
When you have an error in the U-SQL query, you can see often the details of this error directly in the 'Job View' screen. In case of an Vertex user code error, you do not immediately see the error details on this screen. If you want to see details of this error, scroll down in the 'Job Summary' and click on 'Resources'. Then choose 'Profile' and search for the keyword 'jobError'. This row contains the details of the error.

Visual Studio - U-SQL Query error details















4) Result
Now go to the Azure portal and to your Azure Data Lake Store. Open the new file in 'Data Explorer'. Our output file is located in the folder 'SensorData' and then 'Output'. The result should look like this:


Azure Portal - View result in Data Lake Store











Conclusion
In this post we went deeper into building an U-SQL script using Visual Studio. In our opinion, you should develop as much as possible in Visual Studio, because we all know the benefits of this tool like TFS and debugging.