Saturday, 25 February 2017

Azure - Understanding Stream Analytics Data Lake Store Destination

Case
How do you use Azure Data Lake Store to store Stream Analytics data and why would you do that?

Solution
In an earlier post we have set up a Stream Analytics Job where sensor data is send to a Blob file. In this case we create a new Stream Analytics Job with Data Lake Store as output and were we use the sensor data as input. Despite the fact that the Input and Output is different, the query stays the same as other Stream Analytics Jobs. You can find the configuration of the Stream Analytics Job query here.

A Data Lake Store is a scalable repository optimized for storing IoT data, log files and other large datasets for Big Data scenarios. It contains folders and these folders contain the files (data). As said before, you can also store (sensor) data using Blob storage. These two storage options are similar, but there are some important differences, which we will explain later in this post.

Overview Data Lake Store with different types of data




















1) Create a Data Lake Store
First we have to create the Data Lake Store. Click on New (+ icon) in your portal and under the Storage category you will find the Data Lake Store. Give it a suitable name. Next we choose the Resource Group which we created earlier by setting up the IoT Hub. The benefit of choosing the same group is that among other things, the rights are the same. At this time there are not many options available in both Location and Pricing, so it's default.

Azure Portal - Create Data Lake Store














2) Create the Stream Analytics Job
Before creating this new job, we already added a new consumer group to our IoT Hub, called 'DataLake'. Using multiple consumer groups makes it possible for several consumer applications to read data from this IoT Hub independently. Click here to see where you can add/manage consumer group(s).

Note:
The Azure Portal is still in development, so adding a new consumer group is now at a different place than in an earlier post. See screenshot below.

Azure Portal - Adding Consumer Groups














Now we can create the job. We choose the same Resource Group as the IoT Hub and Storage account, just like the creation of the Data Lake. Our Location is the Netherlands, so we choose West-Europe.

Azure Portal - Create Stream Analytics Job














3) Configure the Stream Analytics Job
We start with defining the Input. We choose 'Data stream' as Source Type because the sensor data is an ongoing stream and is derived from the IoT Hub. Under Source we choose 'IoT hub'. In our case we have one IoT Hub, but when you have multiple IoT Hubs you can choose one from the list. Now our IoT Hub appears automatically. Next we choose 'datalake' as Consumer group. This is the one we have created earlier. Finally we choose 'JSON' as Event serialization format.

Now we can specify the Output. Give it a suitable name and choose 'Data Lake Store' as Sink. The corresponding (in thise case the one we have created earlier) Account Name will be selected automatically. Next we enter a file path to store the files in our Data Lake Store account. Optionally, you can specify this path with date and time. Now the data will be stored in multiple instances per day and per hour and this makes the storage more clearly. At last you will choose the 'JSON' format.

In this post we only configuring the Input and Output of the job, as mentioned earlier. Now we can run the job with a valid query.

Azure Portal - Configure Stream Analytics Job














Result
After running the Stream Analytics Job, the data is now stored in the Data Lake Store. You can find this in the Azure portal. Open the Data Lake Store and go to Data Explorer. Now you see one folder named 'sensordata'. This folder contains multiple subfolders: year, month, day and hour. Now we have only one file with data in the subfolder 'hour', but each next hour there will be a new file (as long as your Stream Analytics Job is running). This is exactly what we have configured earlier. In each file, the data is stored per 10 seconds. It works!

Azure Portal - Result Data Lake Store















Differences between Blog Storage and Data Lake Store
The first big difference is the size limits. There are no limitations in account/file size or number of files at a Data Lake Store, while Blob storage has such restrictions. In addition Data Lake Store has built-in Hadoop integration. Therefore (along with the unlimited storage) this makes it suitable for storing Big Data and then analyze this data. Another difference is the authentication. Blob storage works with generated storage access keys, while a Data Lake Store use Azure Active Directory for this. 

Overall a Data Lake Store has more possibilities then Blob storage and is optimized for Big Data purposes. The general purpose of Blob Storage is storing data in different scenarios like backups or media files (for streaming). The starting prices are lower for Blob storage, but you have different storage prices for your Blob Storage account. This means: you can make it as expensive as you want and in some cases the monthly charges per GB will be higher than a Data Lake Store. Click here for more details about the differences and prices between a Data Lake Store and Blob storage.

Conclusion
Azure Data Lake Store is very useful with Big Data scenarios because it can combine storage (to more then 1 petabyte) with the ability to analyze this data. This can be done with Hadoop (built-in integration) or Azure Data Lake Analytics, which is specially optimized to work with Azure Data Lake Store. From this perspective it offers more than, for example Blob storage.

Tuesday, 21 February 2017

Use BIML and csv files to create tables

Case
The case is about importing flat files (CSV’s) without the necessity of metadata. Because BIML always checks if the tables are accessible before creating the packages, the first step is to create the tables with BIML and the second step is to create the SSIS packages for transporting the data.

  1. Creating tables in the database 
  2. Packages to fill this database

Because of the size of the solution I’ve created two separate solutions, one for creating the tables and a second for creating the SSSI packages. You can click on the link to go to the other solution (which I will deploy later this month).


Solution - Creating tables with BIML
In this solution, we create a BIML that is going to create the tables and all the columns are defined as strings.
We have to create two BIML scripts, the first script for defining the table definition and the second for creating the actual package.

1) CSV files
For this example we are using two CSV files (age and sickleave) which are comma separated and have columnnames on the first row. These columnnames will be used in the create table statement

the drop map








content csv file










2) Tabledefinitions
The first biml is called “1_TableDefinitions.biml”
In this biml we define the path were the CSV files are located, an array with the names of the csv files and also some string which we going to use further in the code.
We use two “foreach loops”, the first one loops trough the array with files and the second one loops trough the actual file (to extract the column names).

Normally (without the loop) the code should look like this:

< tables>
 <columns>
   <column datatype="Int32" identityincrement="1" identityseed="1" name="AgeID">
   <column datatype="Int32" name="AgeFrom">
   <column name="AgeTo">
   <column datatype="String" length="255" name="AgeCategoryEmployee">
  </columns>
 </columns>
</tables>

Default BIML uses INT as an default datatype, in this case we use a string.
Now we add the loop in place and the complete code looks like this
<Biml xmlns="http://schemas.varigence.com/biml.xsd">

<#  
    string Prefix="Man";
    
    // the locatie of the csv's'
    string path = @"D:\Drop\Man";
    // Put all the filenames with the extension csv in a string array
    string[] myFiles = Directory.GetFiles(path, "*.csv");
    // string that will be filled with the filename
    string filename;
    // string array for columnnames extracted from CSV
    string[] myColumns;
#>
        <Connections>
            <OleDbConnection 
            Name="OLEDB_STG_<#=Prefix#>" 
            ConnectionString="Data Source=APPL43;Initial Catalog=dummy_STG;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;">
            </OleDbConnection>
        </Connections>
        <Databases>
             <Database ConnectionName="OLEDB_STG_<#=Prefix#>" Name="dummy_STG"/>
        </Databases>
       <Schemas>
              <Schema Name="dbo" DatabaseName="dummy_STG" Owner="dbo"/>
       </Schemas>
       <Tables>
            <!-- loop trough the array of files-->
            <#  foreach(string filePath in myFiles) 
                {
               // extract the filename from the path to use as tablename     
               fileName = Path.GetFileNameWithoutExtension(filePath);
            
            #>
            <Table Name="<#=Prefix#>_<#=fileName#>" SchemaName="dummy_STG.dbo">
                <Columns>
                   <!-- loop trough the file looking for the columnnames-->
                    <#
                    // read first row of csv to extract to columnnames
                    // and split on comma to create an array
                    StreamReader myFile = new StreamReader(filePath);

                    myColumns = myFile.ReadLine().Split(',');
                    // close file after reading first line
                    myFile.Close();

                    // Loop through column array
                    foreach(string myColumn in myColumns) 
                    {  
                    #>
                      <Column Name="<#=myColumn#>" DataType="String" Length="255"></Column>
                 <#  }   #>
                </Columns>    
            </Table>
            <# }#>
       </Tables>
</Biml>
 
<#@ template language="C#" hostspecific="true"#>
<#@ import namespace="System.IO"#>


3) CreateTable
Secondly we are going to create the biml, called 2_CreateTables.biml. which creates the actual package that contains the create statements to generate the tables.
BIML has an method to create SQL tables “RootNode.Tables.First().GetTableSql();”
We use this method to create ‘SQL create statement’ the of table

The code looks like this

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Packages>
        <Package Name="CreateTables" AutoCreateConfigurationsType="None" ConstraintMode="Linear">
            <Tasks>
                <# 
                // Loop trough the table definition os the first biml
                foreach(var table in RootNode.Tables) {#>
                <ExecuteSQL Name="SQL - Drop_Create <#=table.Name#>" ConnectionName="<#=table.Connection.Name#>">
                    <DirectInput>
                        <#=table.GetTableSql()#>    
                    </DirectInput>
                </ExecuteSQL>
                <# } #>
            </Tasks>
        </Package>
    </Packages>
</Biml>

<!--Includes/Imports for C#-->
<#@ template language="C#" hostspecific="true"#>
<#@ import namespace="System.Data"#>
<#@ import namespace="System.Data.SqlClient"#>


We’ve created 2 bimls, 1_TableDefinitions.biml and 2_CreateTables.biml. Now comes the important part (I’m using Biml Express) for generating the package. First we click on 1_TableDefinitions and secondly on and 2_CreateTables, if you have selected the 2 biml scripts  you click with your right mouse on 1_TableDefinitions.biml and generate SSIS packages. If you do this otherwise, you will get an empty SSIS package. .

Generate SSIS package







Below you can see the result of your BIML scripts: a package with an execute SQL Task for each table you need to create.
Visual studio











The actual create statement looks like this

SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
GO
-------------------------------------------------------------------
IF EXISTS (SELECT * from sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Man_Age]') AND type IN (N'U'))
DROP TABLE [dbo].[Man_Age]
GO

CREATE TABLE [dbo].[Man_Age]
(
-- Columns Definition
 [AgeID] nvarchar(255) NOT NULL 
, [AgeFrom] nvarchar(255) NOT NULL 
, [AgeTo] nvarchar(255) NOT NULL 
, [AgeCategoryEmployee] nvarchar(255) NOT NULL 
, [AgeCategoryClient] nvarchar(255) NOT NULL 

-- Constraints

)
ON "default"
WITH (DATA_COMPRESSION = NONE)
GO
-------------------------------------------------------------------


Summary
We created two biml scripts one for creating to table definition and one which creates the actuale packages.The result in Management Studio looks like this.

Management studio result














In the next blog I’m going to explain how to create SSIS packages that transport the data from the csv files