Microsoft BI Tools: 2016

Thursday, 15 December 2016

Azure Event Hub vs IoT Hub

Case
During our journey we noticed that in our team there is some confussion about the differences between an Event hub and an IoT hub. After some research we find out that there are a lot of similarities but also differences. In this blog I will explain the concept of an Event/IoT hub and a best practice when to use an event hub and when to use an IoT hub.
The goal of this article is to give a global image of the Event hub and IoT hub. Please follow the links for more in-depth information.

Solution

Before we can find out what the differences and similarities are, the first question that is: “what is an event hub, and how do we use it?”

1) Event hub

An Event hub is a gateway to the Azure cloud. It’s main purpose is to collect the incoming data and pas it to the Azure cloud, as seen in figure 1. An Event hub process the income data, but on a low profile scale. It doesn’t have advanced sequencing or delivery guaranties. Therefore Event hubs are a high scale messaging service, with a low latency and a high reliability. In our cases we use an event hub to collect the data from the raspberry, but it can also be used in other cases, like collecting data from console games or other telemetry.

Figure1: Event Hub

Protocol

The connected devices/entities are called: Event publisher

Connecting Event publisher to the Azure event hub is easy, because it support the HTTP/AMQP protocols. The most used protocol is AMQP protocol. See here for more information about this subject

Partition

The Event hub uses partitions. Partitions are an ordered sequences that keep events in the Event hub. This sequence is based on the ‘first in first out’ principal. The number of partitions that can be used at the same time is between 2 and 32. Please note that when you create an event hub, you have to set the number of partitions. Strangely it cannot be changed afterwards. Mostly the number of partitions are ased on the amount of readers you are going to use (meaning the use of partitions further in the process). The default number of partitions is 4.

In short an Event hub is a high scale telemetry, one way, service, using the HTTP/AMQP protocol and is generally available worldwide.

For more information ‘how to develop with event hub’ see the programming guide
Setting up an Event hub follows in an other blog (comming soon).

2) IoT Hub

But with the ‘grow of Iot’ there came additional needs: control, device authentication and authorization, protocol translation, etc.

Since an Event hub is an one way point of entry it’s limited in the additional needs as mentioned before..

Figure2: IoT hub

And this is where the IoT Hub kick in. The IoT hub can do the same things as an Event hub, but it’s capable of much more. The most important thing, it can handle bi directional traffic, meaning that an IoT hub is capable of sending data back to the connected devices.

Now it’s possible to command and control the devices, e.g. you can send a disconnect event to the device or a threshold event, e.g. when the machine reach a certain temperature that you can shutdown the machine.

Devices can be registered, so you can identify devices to check whether they are allowed to connect. It’s possible to connect more than 10 million devices (where the Event hub can handle up to 1 million devices) , it is also easy to import bulk device identities (which is easy when you are use 10 million devices ) .

The IoT hub can handle device error reporting, e.g. you can check the failed connection attempts per device. This can result in disconnection/disabling the device/Sensor in the IoT hub (so the sensor isn’t allowed to connect to the hub anymore).

It also support the AMQP over webSockets en MQTT protocol whereby the latter no protocol gateway is needed (when using Azure IoT SDKs).

For more in-depth information about the IoT hub, please see also the reference architecture
For setting up the IoT hub see our earlier blogspot: Setting up IoT hub

Summary

The IoT Hub can do the same as an Event hub, but much more. Mostly because the bi-directional communication possibility, ergo an IoT hub is 'Event Hub plus'.

So why not all use the IoT hub instead of the Event hub? Well one thing we didn’t mentioned was the pricing. With all the extra capabilities of the IoT hub the pricing is also a lot higher. Sometimes up to 40 times higher. So for simple event, like reading data from a weather station, or counting how many times a door is opening a IoT hub is not necessary.

Wednesday, 23 November 2016

Azure - Understanding Stream Analytics Blob Destination

Case

How do you use Azure Blob Storage to store stream analytics data and why would you do that?

Solution
In earlier posts we have set up an IoT environment with an IoT Hub and a couple of Stream Analytics Jobs where sensor data is sent to different destinations: Power BI for a real-time dashboard and Azure SQL Database to store the data. In this case we create a new Stream Analytics Job with Blob Storage as output and we use the sensor data as input. The Input and Output are different, but the query is the same as other Stream Analytics Jobs. Configuring the query of the Stream Analytics Job can you find here.

Reasons to use Blob Storage is the diversity of storing in the cloud of text or binary files. The unstructured data can include documents, social data (photos, videos, music and blogs), Big Data (logs, IoT and large datasets) or images and text for web applications. Click here for more pricing details about Blob Storage. You can also store (sensor) data in an Azure Data Lake. Click here for more information about this, along with the major differences between Blob Storage and Data Lake.

For the use of Azure Blob storage you need a Storage account. Then you can add containers to this account, which include the Blob files. In the case of an image, you also find a metadata file. In our case the Blob files are JSON files and contain sensor data.

Overview Azure Blob Storage with images

1) Create a Storage account
To make use of Blob Storage you have to create a Storage account first. After we give it a suitable name, we choose Blob storage and 'RA-GRS' as Replication. This is the default and it contains the most options. Click here for more information about this. Next we choose 'Hot' by Access tier, because we want access the sensor data frequently.

Azure Portal - Create a Storage account

Note:
You can also create a Storage account when setting up the Output of the Stream Analytics Job, but you have less options so it is not recommended.

2) Create the Stream Analytics Job

Before creating the new job, we had already add a new consumer group to our IoT Hub. We called it 'blob'. Using multiple consumer groups makes it possible for several consumer applications to read data from this IoT Hub independently. Click here to see where you can add/manage consumer group(s).

Now we can create the job. We choose the same Resource group as the IoT Hub and Storage account, because these are in the same life cycle. Our Location is the Netherlands, so we choose West-Europe.

Azure Portal - Create Stream Analytics Job

3) Configure the Stream Analytics Job

First we must add a new Input to the job. The default Source Type is 'Data stream'. We choose this because the sensor data is an ongoing stream and is derived from the IoT Hub. The Source is 'IoT hub' and then the IoT Hub that you have created automatically appears. If you have more then one IoT Hub, you can choose one from the drop-down list. After this you must choose the right Consumer group. This is the new group (blob) we have created earlier. Finally you choose 'JSON' as Event serialization format.

Next we add a new Output. First choose 'Blob storage' in Sink and 'Use blob storage from current subscription' as Subscription, because you have configured the storage account earlier. Otherwise you can edit the storage account settings here by choosing 'Provide blob storage settings manually'. Then you create a new container. Optionally, you can define one or more instances (subfolders) within the container. With this option you can change the date and time format so you can have multiple instances including different dates and/or time folders. This makes it more clear (just like your own local File Explorer) and you have the choice to select data from a specific day/time. We made a instance called 'sensor'. At last you will choose the 'JSON' format.

As we said, we discuss only the configuration of the Input and Output of the job in this post. After configured the job we will run the job with a valid query.

Azure Portal - Configure the Stream Analytics Job for Blob

Result

Now the data is stored, we want to see what's in our Blob. Therefore you have to go to your Azure Storage account in the portal. Every object, in our case a Blob, that you store in Azure Storage has a unique URL address. For the Blob service with the storage account name (bitoolsblobstorage) you have created and the container name (sensordata) with the instance (sensor) the URL/endpoint is:

http://bitoolsblobstorage.blob.core.windows.net/sensordata/sensor

More information about the Azure Storage endpoints here.

In the portal go to the Storage account you have made earlier. Click on the container URL and then you see the instance (subfolder). Now you can drill down further on the specific month, day and hour of the incoming sensor data.

Azure Portal - Your Blob file

Conclusion

It looks a lot like the other Stream Analytics posts, but in this case the data is stored in a Blob file. From this point you have several options to do something with this data. First you can do nothing off course and in that case you use Blob purely for storage (backup). You will find the other options in the Cortana Intelligence Suite. For example process the data with Azure Data Factory (this can also be done with traditional SSIS, which is off course not a part of the CIS), analyse the data with Machine Learning or visualize the data in Power BI.

Thursday, 17 November 2016

Extracting tweets with Azure App Service - Logic App

Case
I want to extract tweets from twitter to my data warehouse without writing code. Is that possible?

Extract #SSIS Tweets

Solution
There are of course several options with traditional ETL tools like SSIS with custom .NET code, a custom twitter tasks or even by calling the Twitter API within R. But Microsoft also recently introduced Microsoft Flow and Microsoft Azure Logic Apps which can connect to various apps and services (like Facebook, Google Drive, GitHub, Twitter, SharePoint, etc.) to collect data, synchronize files or get notifications of certain events, without writing any code! The basics of Flow are free and Logic App is the professional version in Azure with of course some more options.

Solutions:
A) Microsoft Flow
B) Microsoft Azure Logic Apps

1) Microsoft Azure Logic Apps
First go to portal.azure.com, login and create a new Logic App (located under Enterprise Integration). The name of our Logic App is called SSISTweets (you have to choose a different name) and we use West-Europe as the location of the Resource Group since we life in the Netherlands. When you click on the create button the Logic App will be created and you can start editing. You can choose between various default templates or you can create a Blank LogicApp.

Creating a new Logic App

If you have also tried Microsoft Flow, you will probably notice that there are less templates to choose from. We will choose the Blank LogicApp for this example.

2) Twitter source
In the blank LogicApp you first need to search for a trigger. In this case we want to search for tweets on twitter. The trigger is 'When a new tweet is posted'.

Blank LogicApp searching for Twitter trigger

When you select the Twitter trigger you need to setup the Twitter connection and the search text. For this example I will login with my Twitter account and search for tweets with #SSIS. You can also change the frequency and interval.

Connecting to Twitter and setting up search

3) Filtering annoying job tweets
Of course we want to filter those annoying job tweets. Therefore we need to add a condition. In the condition pane you can select fields from the Twitter 'source' to filter on. In this example we are filtering tweets that contain the word 'job' (probably not the perfect filter).

Add 'job' filter

4) Add SQL Destination
Now we need to store the tweets in a (Azure) SQL Server database. For this example we created a simple table with all nvarchar(255) fields:

CREATE TABLE [dbo].[Tweets](
    [Name] [nvarchar](255) NULL,
    [TweetText] [nvarchar](255) NULL,
    [TweetTime] [nvarchar](255) NULL
)

Add a new step and search for SQL Server - Insert row. Then enter the credentials from your database, select the table (must be an existing table) and map the Twitter fields to the appropriate table columns.

SQL Destination - Insert row

5) Save LogicApp
After the last step has been added click on Save to save your work and then on Done. Now your newly created flow is ready to run.

Saving the LogicApp

Now click on the run button and watch the table!

Run LogicApp

The result

Conclusion
Microsoft Logic App is very easy to use and you don't need programming skills. There are dozens of apps and services to connect to. If you only need one Logic App and a 15 minute refresh is enough then you could also consider using Microsoft Flow which is free of charge.

Extracting tweets with Microsoft Flow

Case
I want to extract tweets from twitter to my data warehouse without writing code. Is that possible?

Extract #SSIS Tweets

Solution
There are of course several options with traditional ETL tools like SSIS with custom .NET code, a custom twitter tasks or even by calling the Twitter API within R. But Microsoft also recently introduced Microsoft Flow and Microsoft Azure Logic Apps which can connect to various apps and services (like Facebook, Google Drive, GitHub, Twitter, SharePoint, etc.) to collect data, synchronize files or get notifications of certain events, without writing any code! The basics of Flow are free and Logic App is the professional version in Azure with of course some more options.

Solutions:
A) Microsoft Flow
B) Microsoft Azure Logic Apps

1) Microsoft Flow
First go to flow.microsoft.com and login with your Microsoft Account. You can now search for existing templates or create a flow from blank. We will start with a blank flow for this example.

Searching for Twitter templates or use the blank flow

2) Twitter source
In the blank flow you first need to search for a trigger. In this case we want to search for tweets on twitter. The trigger is 'When a new tweet is posted'.

Blank flow searching for Twitter trigger

When you select the Twitter trigger you need to setup the Twitter connection and the search text. For this example I will login with my Twitter account and search for tweets with #SSIS.

Connecting to Twitter and setting up search

Add ' job' filter

4) Add SQL Destination
Now we need to store the tweets in a (Azure) SQL Server database. For this example we created a simple table with all nvarchar(255) fields:

CREATE TABLE [dbo].[Tweets](
    [Name] [nvarchar](255) NULL,
    [TweetText] [nvarchar](255) NULL,
    [TweetTime] [nvarchar](255) NULL
)

SQL Destination - Insert row

5) Create flow
After the last step has been added click on Create flow and then on Done. Now your newly created flow is ready to run. Watch the table!

Create flow

The result

Conclusion
Microsoft Flow is very easy to use and you don't need programming skills. There are dozens of apps and services to connect to.
But to keep it for free you can only check once each 15 minutes and only do 750 runs a month. That should be enough for at least on flow, but if that's not enough you can either switch to a premium account which gives you more runs, more checks and even more services to connect to. Or you could switch to professional Microsoft Azure Logic Apps.

Tuesday, 4 October 2016

Azure - Understanding Stream Analytics Windowing Functions

Case
I want to aggregate data in a stream. How does that work in Stream Analytics?

Example data: count people in front of booth

Solution
Because it's a stream you aggregate data in a certain time window instead of the whole dataset. The Stream Analytics Query language is very similar to TSQL and it it has some extensions like the Windowing functions to aggregate data in time windows. At the moment there are three Windowing extensions (Hopping, Sliding and Tumbling), but it is not inconceivable that more windowing functions will be added in the near future.

General rules:

The length of each window is fixed
Windowing only work in combination with the GROUP BY clause
The time units can be day, hour, minute, second, millisecond or microsecond, but the maximum size of the window in all cases is 7 days.

1) Tumbling Window
The Tumbling Window is the easiest to explain. It aggregates data within a X second/minute/etc. time window and does that every X seconds/minutes/etc.

For example: Tell me the average number of visitors per booth over the last 10 seconds every 10 seconds:

SELECT    Booth, avg(HeadCount) as AvgHeadCount
FROM      HeadCountStream TIMESTAMP BY MeasurementTime
GROUP BY  Booth, TumblingWindow(second, 10)

Tumbling Window

2) Hopping Window
The Hopping Window is very similar to the Tumbling Window, but here the windows have a overlap. It aggregates data within a X second/minute/etc. time window and does that every Y seconds/minutes/etc.
For example: Tell me the average number of visitors per booth over the last 10 seconds every 5 seconds:

SELECT    Booth, avg(HeadCount) as AvgHeadCount
FROM      HeadCountStream TIMESTAMP BY MeasurementTime
GROUP BY  Booth, HoppingWindow(second, 10, 5)

Hopping Window

3) Sliding Window
The Sliding Window is the most difficult to explain. It aggregates the values in the time window every time a new event/measurement occurs or an existing event/measurement falls out of the time window.
So when using the Sliding Window you are interested in aggregating values when ever an event occurs. This is in contrast to the Hopping and Tumbling windows which have a fixed interval.

For example: Tell me the average number of visitors per booth in the last 10 seconds:

SELECT    Booth, avg(HeadCount) as AvgHeadCount
FROM      HeadCountStream TIMESTAMP BY MeasurementTime
GROUP BY  Booth, SlidingWindow(second, 10)

Sliding Window

The first aggregation occurs when the first measurement value (1) is streamed.
The second aggregation occurs when a new measurement value (3) is streamed.
The third, fourth, fifth, etc. is equal to the second aggregation because each time a new measurement value (1) is streamed.
The last aggregation occurs when no more new measurement values are streamed and the second last measurement value (2) falls out of the time window.

Sliding Window

Compared to the Hopping Window (with the same data) you only get an extra result row at the start (1) and one at the end (1) because in this example the events happen in a fixed interval. It gets more interesting when the events are coming in more randomly like tweets about a certain subject.

Timestamp by
You probably noticed the TIMESTAMP BY measurementTime clause after the FROM. This tag lets you set the exact timestamp that an event occurred, rather than the arrival time in the IoT Hub. This timestamp is used by the windowing functions.

Testing query with Windowing functions
In the old portal you can test the query and study the result (at the moment of writing, testing is not yet supported in the new portal). For this you need a json file with some messages in it. These messages should look identical like the messages you send via the IoT Hub. For this example I created a text file with the following text in it:

{"headCount":1,"measurementTime":"2016-09-17T18:25:43.511Z","sensorName":"A"}

{"headCount":3,"measurementTime":"2016-09-17T18:25:47.511Z","sensorName":"A"}

{"headCount":2,"measurementTime":"2016-09-17T18:25:53.511Z","sensorName":"A"}

{"headCount":3,"measurementTime":"2016-09-17T18:25:57.511Z","sensorName":"A"}

{"headCount":4,"measurementTime":"2016-09-17T18:26:03.511Z","sensorName":"A"}

{"headCount":2,"measurementTime":"2016-09-17T18:26:07.511Z","sensorName":"A"}

{"headCount":2,"measurementTime":"2016-09-17T18:26:13.511Z","sensorName":"A"}

{"headCount":1,"measurementTime":"2016-09-17T18:26:17.511Z","sensorName":"A"}

When you hit the test button in the query editor you need to upload a json file for testing. Then it will use that data to test your Stream Analytics query and show the result. In the pictures below you will see the test data in the upper right corner and the query result at the bottom. The red numbers show how the average was calculated.

Sliding Window

When you leave out one measurement, two rows will change in the result.

Sliding Window, leaving out 1 measurement

When you leave out two successive measurements, two rows will change in the result and one row will disappear.

Sliding Window, leaving out 2 successive measurements

Conclusion
Tumbling and hopping window are easy to understand. Sliding window is a little harder to understand, but writing the query is very easy. Using the windowing functions is something you would probably want to do in the hot path to stream to PowerBI. This way you don't get to much data in the stream.