Sunday 30 July 2017

Azure SQL Database vs Azure SQL Data Warehouse

Case
As we slowly move from on-premises data warehouses with Microsoft SQL Server to cloud data warehouses in Microsoft Azure, we need to know more about the various options in Azure. You probably already used an Azure SQL Database, but Microsoft also introduced Azure SQL Data Warehouse. What are the differences between these two databases?
Azure SQL DB vs Azure SQL DW














Solution
Back in 2013, Microsoft introduced Azure SQL Database which has its origin in the on-premises Microsoft SQL Server. In 2015 (however public availability was in July 2016) Microsoft added SQL Data Warehouse to the Azure cloud portfolio which has its origin in the on-premises Microsoft Analytics Platform System (APS). This was a Parallel Data Warehouse (PDW) combined with Massively Parallel Processing (MPP) technology and included standard hardware. It is the 'big brother' of SQL Server, but with a slightly different purpose.

In this post we will briefly describe the differences between these two Microsoft Azure Services, but first Microsofts own definitions:
  • Azure SQL Database is a relational database-as-a service using the Microsoft SQL Server Engine (more);
  • Azure SQL Data Warehouse is a massively parallel processing (MPP) cloud-based, scale-out, relational database capable of processing massive volumes of data (more);

Differences
1) Purpose: OLAP vs OLTP
Although both Azure SQL DB and Azure SQL DW are cloud based systems for hosting data, their purpose is different. The biggest difference is that SQL DB is specifically for Online Transaction Processing (OLTP). This means operational data with a lot of short transactions like INSERT, UPDATE and DELETE by multiple people and/or processes. The data is most often highly normalized stored in many tables.

On the other hand SQL DW is specifically for Online Analytical Processing (OLAP) for data warehouses. This means consolidation data with a lower volume, but more complex queries. The data is most often stored de-normalized with fewer tables using a star or snowflake schema.

2) Achitecture
In order to make the differences more clear a quick preview of the architecture of Azure SQL Data Warehouse, where you see a whole collection of Azure SQL Databases and separated storage. The maximum number of compute notes at the moment is 60.
    Architecture SQL DW: Decouples (Blob) storage from compute (SQL DB)





















    3) Storage size
    The current size limit of an Azure SQL Database is 4TB, but it has been getting bigger over the past few years and will probably end up around 10TB in the near future. On the other hand we have the Azure SQL Data Warehouse which has no storage limit at all (only the limit of your wallet), because the storage is separated from the compute.

    3) Pricing
    The pricing is also quite different. Where Azure SQL DB starts with €4,20 a month, Azure SQL DW starts around €900,- a month excluding the cost of storage which is included in SQL DB. The storage costs for Azure SQL DW are around €125,- per TB per month. And the maximum costs of a single SQL DB is around €13500,- where SQL DW ends around a massive €57000,- (excl. storage). But when you take a look at the architecture above, it should be no surprise that SQL DW is more expensive than SQL DB, because it consists of multiple SQL DBs.

    However, SQL DW has one big trick up its sleeve that SQL DB hasn't: you can pause it completely and then you only pay for storage. If you start your SQL DW with your ETL job and pause it right after you processed your Azure Analysis Services then you only need it a small percentage of the month.

    Note: prices are from July 2017

    4) DTU vs DWU
    SQL DB has 15 different pricing tiers which specify the number of Database Transaction Units  (DTU) and the storage size/type:
    - Basic
    - Standard (S0, S1, S2, S3)
    - Premium (P1, P2, P3, P4, P6, P11, P15)
    - Premium RS (PRS1, PRS2, PRS4, PRS6)
    Basic has only 5 DTUs and the highest number of DTUs is, at the time of writing, 4000.
    The term DTU is a bit vague. It is a mysterious combination of RAM, CPU and read-write rates, but basically if you want to double the performance of your current database you just need to double the number of DTU's for your database.

    SQL DW has 12 different pricing tiers and uses Data Warehouse Units (DWU) to specify the performance level.
    - DWU100, 200, 300, 400, 500, 600, 1000, 1200, 1500, 2000, 3000, 6000
    The term DWU is a little less vague, because if you divide that number by 100 you have the number of compute nodes available for that pricing tier. On the other hand the exact combination of CPU, memory and IOPS per compute note is unknown.

    Because both services have a different purpose it is a bit strange to compare the hardware, but according to this MSDN blog post 1 DWU is approximately 7,5 DTU.

    But there is also some similarity: for both services you can use the same script to change the pricing tier on the fly to either give the performance a real boost when needed or the save money in the quiet hours.

    5) Concurrent Connection
    Although SQL DW is a collection of SQL Databases the maximum number of concurrent connections is much lower than with SQL DB. SQL DW has a maximum of 1024 active connections where SQL DB can handle 6400 concurrent logins and 30000 concurrent sessions. This means that in the exceptional case where you have over a thousand active users for your dashboard you probably should consider SQL DB to host the data instead of SQL DW.
    For more details see the SQL DB Recource Limitations and SQL DW Recource Limitations.

    6) Concurrent Queries
    Besides the maximum connections, the number of concurrent queries is also much lower. SQL DW can execute a maximum of 32 queries at one moment where SQL DB can have 6400 concurrent workers (requests). This is where you see the differences between OLTP and OLAP.
    For more details see the SQL DB Recource Limitations and SQL DW Recource Limitations.

    7) PolyBase
    Azure SQL Data Warehouse supports PolyBase. This technology allows you to access data outside the database with regular Transact SQL. It can for example use a file in an Azure Blob Storage container as a (external) table. Other options are importing and exporting data from Hadoop or Azure Data Lake Store. Although SQL Server 2016 also supports PolyBase, Azure SQL Database does not (yet?) support it.

    8) Query language differences
    Although SQL DW uses SQL DB in the background there are a few minor differences when quering or creating tables:
    - SQL DW cannot use cross databases queries. So all your data should be in the same database.
    - SQL DW can use IDENTITY, but only for INT or BIGINT. Moreover the IDENTITY column cannot be used as part of the distribution key.
    - Also see this SQL DW list of unsupported table features.

    9) Replication
    SQL DB supports active geo-replication. This enables you to configure up to four readable secondary databases in the same or different location. SQL DW does not support active geo-replication, only Azure Storage replication. However this is not a live, readable, synchronized copy of your database! It's more like a backup.

    10) In Memory OLTP tables
    SQL DB supports in-memory OLTP. SQL DW is OLAP and does not support it.

    11) Always encrypted
    SQL DB supports Always Encrypted to protect sensitive data. SQL DW does not support it.

    Conclusion
    Although Azure SQL DB looks much cheaper on a monthly basis, this doesn't mean you should always choose SQL DB by default. One big advantage is that you can pause SQL DW. For example a stage database is only used during the ETL process. Why should you always have this database up an running? Or why should your datamart stay online 24*7 if your end users only use Analysis Services to browse the data.

    On the other hand the original purpose of both services is different (OLTP vs OLAP), but this doesn't mean you should always use Azure SQL DW for your data warehouses. Depending on several factors like data size, complexity, required up-time and budget, Azure SQL DB could also host your data warehouse. You could even mix them in your project. For example Stage and Historical/Persistent stage in Azure SQL DW and your Datamart in Azure SQL DB.



    Please leave a comment if you know more significant differences that are worth mentioning.

    Sunday 23 July 2017

    Azure - Use Azure Data Lake for Big Data

    Case
    We have collected sensor data and we want to use this in a Data Warehouse (DWH). Because we do not want to store raw data in our database, we need to resolve this first. How can we accomplish this with Azure Data Lake?
    
    
    Over Azure Data Lake











    Solution
    We use Azure Data Lake Store (ADLS) to store the sensor data. As we know from an earlier post, ADLS is extremely suitable for storing unstructured data and we showed an example of how you can store this sensor data. In the first example the data, one file of each day, is already stored in ADLS. You can download the file here. Next we are going to aggregate this data per day and create a file that is ready to load.

    To accomplish this we will use another feature of Azure Data Lake, called Azure Data Lake Analytics (ADLA). With this and Data Lake Store, Microsoft offers new features similar to Apache Hadoop to deal with petabytes of Big Data. The advantage of Data Lake Analytics is that it supports Hadoop, but also introduce a similar language like T-SQL, called U-SQLThis is Microsoft's new Big Data query language. It is T-SQL with a little touch of C# to add even more features to the language. Click here for more details about this.

    Note:
    If you don't have sensor data ready, you can download a sensor generator for free. This 'SensorEventGenerator' can be found here. In addition, you have to create an Event Hub / IoT Hub to sending these sensor data to Azure. Click here for more information. In our case we have generated our own data with a similar program and uploaded to ADLS.


    1) Create Data Lake Analytics
    First we have to create a new Data Lake Service with a Data Lake Analytics account. We give the account a suitable name and we choose the same resource group as the ADLS uses. As a last step we need to choose the Data Lake Store, in our case 'bitoolsadls' where the sensor files are stored.

    Azure Portal - Create Data Lake Analytics













    Important to know is the pricing. The creation of the service is free, also while it is running. The payment starts with the use of U-SQL Jobs. You pay for computing power (measured in Analytics Units). More information about pricing here. For example, if we want to run a job 24 hours with 1 UAs (and that complies), this will results in the following costs:

    Data Lake Analytics Cost Indication
    Depending of the amount of data, the number of Analytics Units (UAs) must be increased. 

    1) Create new Job
    Now everything is set, we must create a new U-SQL Job. Go to your Data Lake Analytics you just created and click on 'New Job'. Give it a suitable name. Notice that you have two other options to change: Priority and UAs. By 'Priority' you can determine the importance of the job. For example, if the job has Priority 1, this job will always start first. The other one are the Analytics Units (as we explained earlier). Increasing the 'UAs', will give you a cost indication. For now we leave this both on default, because it is our only job. 

    2) Write a Query
    Let's get started with the query. As I mentioned before, it similar to T-SQL, because you can also write SELECT statements with the familiar FROM and WHERE clauses and transfer the data to a new location. In addition, you must use C# for data type conversions and for example to determine today's date. In this case we retrieve the raw data, aggregate this data and store it in a new CSV file in a new ADLS folder.

    To retrieve the data from the file, use EXTRACT and FROM. You can also use T-SQL variables in the FROM clause to avoid hard coded paths, but for now we fill in our file path hard coded. We do not need to fill in the Data Lake name itself, because the default is our ADLS account. We also have to use the USING statement for specify the extract format, in this case a CSV file. More details about this statement here. Important to know is that we specify that the first row must be skipped, because that's the header. The result of this query will be stored in the variable '@sensorData'.

    Next we want to aggregate our data, because the granularity of the raw data is per second. We take the averages of temperature (temp) and humidity (hmdt) per day and per location. We need a SELECT statement on '@sensorData' variable for this. Besides the aggregations, we are doing some datatype conversions, because extracting the data as string is currently the preferred way. We also use GROUP BY for dividing the result into groups (per day and per location). You may have noticed that we do not select all the columns, because we do not need all columns from the source file.

    Once we have retrieved and transform the data, we want to save it in a new file. Therefore, we use the OUTPUT statement. Define the folder path. This will be automatically created with the new file. Because we are also saving the new file in a CSV format, we will use the same extractor by USING, except now without the 'skipFirstNRows' parameter. Every time you start the job, it will overwrite the file destination.

    The query:

    // Extract the sensor data from CSV file (skip the header)
    @sensorData = 
        EXTRACT 
            [time]                    string
        ,   [dsplid]                  string
        ,   [dspl]                    string
        ,   [temp]                    string
        ,   [hmdt]                    string
        ,   [status]                  string
        ,   [location]                string
        ,   [EventProcessedUtcTime]   string
        ,   [PartitionId]             string
        ,   [EventEnqueuedUtcTime]    string
        FROM "/SensorData/Input/bitools_sample_data_01-01-2016.csv"
        USING Extractors.Csv(skipFirstNRows:1);
    
    // Aggregate the sensor data (average per location) and data type conversions
    @result =
        SELECT 
            DateTime.ParseExact([time], "yyyy-MM-dd HH:mm:ss", null).Date AS Date
        ,   AVG(Convert.ToInt32([temp])) AS Temperature
        ,   AVG(Convert.ToInt32([hmdt])) AS Humidity
        ,   [location] AS Location
        FROM @sensorData
        GROUP BY
            DateTime.ParseExact([time], "yyyy-MM-dd HH:mm:ss", null).Date
        ,   [location];
    
    // Save the sensor data to a new CSV file
    OUTPUT @result
    TO "/SensorData/Output/bitools_sample_data_01-01-2016_AveragePerDayPerLocation.csv"
    USING Outputters.Csv(outputHeader : true, quoting:false);
    

    Note 1:
    The statements EXTRACT and OUTPUT use absolute or relative file paths. That's why we cannot use SELECT for retrieving the data. We also use the C# syntax (//) to add comments. Click here and here for more information about the U-SQL language.

    Note 2:
    PolyBase does not handle column headers that well. If you want to read this output file with PolyBase, you could consider removing "outputHeader : true" from the OUTPUT part of the query.


    Azure Portal - Create the U-SQL Job













    3) Run the Job
    When the query is done, we must click on 'Submit Job'. Now a new screen will appear where we can monitor the running job. When the job is successfully 'Finalizing', we can preview the output file. Notice that the new file has 10 rows: one average row for temperature and humidity per location.

    Azure Portal - Run the U-SQL Job













    Finally, let see if the file is stored in the Azure Data Lake Store. Go to your ADLS and click on 'Data Explorer'. Find your output folder and there it is!

    Azure Portal - See result in Data Lake Store












    Summary
    We showed you how Azure Data Lake is suitable for storing and transform Big Data, in this case sensor data. Off course, there are more ways to get this together using the Cortana Intelligence Suite, for example Stream Analytics.

    In this post we used one single file, but often companies have hundreds or thousand of files. In a next post we show you how to handle multiple input files in Azure Data Lake Analytics.

    Saturday 1 July 2017

    SSMS Snack: Start SSMS as different user

    Case
    My company uses different Windows users for the various DWH environments (Development, Test, Acceptance & Production). How can I connect to a SQL Server instance in SQL Server Management Studio (SSMS) with a different user so that I can still use Windows Authentication?
    Windows Authentication














    Solution
    The quick solution is to hold the Shift-key while right clicking the SSMS shortcut in the start menu. Then the 'Run as different user' option appears, which allows you to enter different credentials. After that the user name field for the Windows Authentication changes.
    Right click the shortcut and choose Run as different user

    
    
    Now you can run SSMS with a different account































    Now the User name changes















    Runas shortcut
    A more permanent solution is to create a new shortcut with a runas command in it. Instead of the standard SSMS command (see Target):
    "C:\Program Files (x86)\Microsoft SQL Server\110\Tools\Binn\ManagementStudio\Ssms.exe" 

    you use:
    RUNAS /user:myDomain\myUserName /savecred "C:\Program Files (x86)\Microsoft SQL Server\110\Tools\Binn\ManagementStudio\Ssms.exe"
    (110 = SQL 2012 / 120 = SQL 2014 / 130 = SQL 2016 / 140 = SQL 2017)

    • The /user: option allows you to use a different user.
    • The /savecred option will save the password after the first time (not in  Windows 7 Home)
    • For more options check out this site or execute "runas /?" in the command prompt to show all options. Some forums/blogs recommend the /netonly option to only use the provided user for remote access, but that often doesn't work causing SSMS not to start.

    Change Target field and optionally the Comment






















    To finish this off: click on the Change Icon button and browse to SSMS.exe to select the familiar icon.
    Change icon to finish off the shortcut






















    The first time you will see a command prompt where you have to enter your password. If you added the /savecred option then the second time you will only see a short 'flash' of the command prompt and then SSMS will start. You could get rid of it by changing the Run property to minimized (after the first execution).
    Enter password









    SSMS commandline options
    You can even extend this solution with some extra options for SSMS itself. Like providing the instance and database name.
    SSMS command line options
















    Summary
    In this post you saw how you can start up SSMS with a different domein user so you can still use Windows Authentication. This not only works for SSMS, but for other programs like Visual Studio as well:
    RUNAS /user:myDomain\myUserName /savecred "C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE\devenv.exe"