Down the Hatch: Ingesting Data into Big Data and IoT Solutions in Azure

By Brian Blanchard

Data warehouses deployed to Microsoft Azure come equipped with several native data ingestion mechanisms. Each of these ingestion options has a native element already deployed to Azure. Additional third-party tools may also be considered. Some examples include SQL Server Integration Services (SSIS), Informatica and Talend, among others. However, the following tools are native to Azure and allow for centralized management of the data ingest process:

Azure IoT Hub

Azure IoT Hub is designed for high velocity, low volume data sources. Generally, this model of consumption is the ideal solution when a large number of data publishers need to continuously push data into the cloud. It’s built on an evolved pub/sub model similar to Azure Event Hubs, but contains wrappers and tools more suited to a production environment. In this model, a device or field gateway will aggregate local data in the field and publish it to an Azure IoT Hub. Additional tools and requirements for consideration include IoT Hub device management, dead letter reckoning, data jitter, and data packet management.

Azure Event Hub

Azure Event Hub is very similar to Azure IoT Hub in that it operates on a similar pub/sub model, but with less full-scale, production-ready operational features. Azure Event Hub is best used when a small number of publishers are producing a similarly high velocity stream of data. Both solutions are based on JavaScript Object Notation (JSON) packages over Advanced Message Queuing Protocol (AMQP) or HTTP/1. The primary difference is in regards to the number of publishers; any solutions that connect to more than ten or so devices would be best ingested via an IoT Hub.

Azure Import/Export

Large volumes of slow-moving data are generally good candidates for Azure Import/Export. When a large volume of data needs to be ingested, but the available bandwidth pipeline is not large enough to accommodate the data transfer, this solution should be considered prior to increasing bandwidth. Azure Import/Export allows for the shipping of pre-configured hard disks to an Azure processing center. Once received, the disks will be copied to Azure Blob storage for further processing.

Azure Data Factory

Azure Data Factory (ADF) is analogous to the hybrid equivalent of SSIS. While not as feature-rich, it does share some similarities. ADF is a data pipeline management tool and it allows for the time-based consumption of data from several on-premise data structures into an “Extract, Load, Transform” (ELT) process. ELT is a new way of looking at legacy-style “Extract, Transform, Load” (ETL) processes. When triggered, a data pipeline will ingress data (extract) and then deposit the extracted data into Blob storage for transient in-process staging (load). It then transforms the data to be consumed by one or more recipient applications/data structures (transform). In most data pipelines, the extracted data may undergo a number of load/transform processes as it merges data sources to create the final transformations.

HDInsight Sqoop

Sqoop is also an ELT tool, in that it extracts and loads data to prepare for transformation to be completed by the receiving node. Sqoop is designed to bridge the ELT gap between structured data and unstructured data. Sqoop is the ideal solution when more granular control over the execution of an ELT process is required than what is allowed by ADF. The key advantage of Sqoop is found in the native open-source connectors for Oracle, MySQL, SQL Server and most flavors of Hadoop.

Connecting Devices to Stream-Based Data Ingestion tools

Azure IoT Hub and Azure Event Hubs each focus on ingestion of high velocity data publishers. Usually this consists of devices in the field. These devices typically generate telemetry data based on the use of the device or factors in its environment. This type of interaction is commonly referred to as Internet of Things (IoT).

In the most commonly discussed scenario, devices are connected and contain a level of intelligence that allows them to communicate directly with an IoT Hub. However, this is not the only way to deploy an IoT solution.

For a device to connect directly to the IoT Hub, it needs sufficient compute and bandwidth to communicate with the IoT Hub. This consists of executing Software Developer (SDK) and Application Program Interface-based (API) logic for registration of the device, as well as regular transmission of JSON data over the secure, registered AMQP/HTTP connection. When a device is lacking in compute power or consistent bandwidth, the following approaches may be more appropriate:

Field Gateway

A field gateway is a software/hardware interface that sits between the IoT Hub and device(s). A field gateway provides the necessary compute and logic capabilities not found in many pre-IoT devices.

LAN Aggregate/LAN Repeater

Much like an edge device in a traditional network, a LAN Aggregate/LAN Repeater is a specialized version of a field gateway that connects to devices on a Local Area Network. It aggregates data from the local devices and then broadcasts the data to an IoT Hub for ingress into the data warehouse solution. The primary difference between a LAN Aggregate and a LAN Repeater is in packet manipulation. A LAN Repeater simply transfers the raw JSON packages directly to the IoT Hub. However, a LAN Aggregate consolidates (or aggregates) the packages to create a less frequent and often times, a smaller package of telemetry data to reduce bandwidth consumption.

Message Queueing Gateway

Mobile devices in the field often need a means of aggregating and replaying data during periods of disconnected operations. A Message Queueing Gateway fills this roll by storing telemetry data in a local data structure, and then replaying the telemetry data in the order it was received when network connectivity resumes.

Mesh Network Gateway

Similar to a LAN Aggregate/Repeater, a Mesh Network Gateway communicates with local devices that may be connected, but not exposed to a traditional network. This type of gateway will commonly include means of consuming telemetry data across BLE (Bluetooth Low Emissions), ZigBee, Wi-Fi, LAN, UART (Universal Asynchronous Receiver/Transmitter or serial port. It then acts as a gateway to publish data to an IoT Hub.

Historian Gateway

A Historian Gateway is common in most manufacturing and established brick-and-mortar operations. It’s a centralized, logical controller that maintains state and telemetry data for various pre-IoT devices in a physical location. A Historian Gateway consumes telemetry data from the Historian, bypassing the devices altogether, and then communicates to the IoT Hub on the Historian’s behalf.

Command & Control Gateway

IoT Hub allows for ingestion of data, as described above. However, it also allows for the Command & Control of local devices. In some scenarios, individual devices are more than capable of communicating with an IoT Hub. However, they still may require a local gateway for environmental Command & Control. In this case, a gateway focuses on receipt of data from an IoT Hub, which it then translates into commands that span multiple devices to re-shape aspects of the ecosystem of devices. For example, a conveyer belt, robotic arm and hopper all communicate directly with IoT Hub. In most cases, this is sufficient. However, if the robotic arm reaches an ambient temperature of 150 degrees, production must be halted. This means a synchronized command must be sent to each device at the same time to issue a halt in each device’s operations. At times, this can be best executed by a Command & Control Gateway.

Gateway Best Practice Considerations

Sometimes connected devices alone are not enough. IoT gateways of various versions may be required to create a holistic IoT solution. These types of approaches generally address bandwidth, compute distribution, jitter (simultaneous load spikes), network translation, protocol translation, security and registration, among other things.

Prior to implementing a gateway, it is important to consider impacts on the following best practice considerations:

  • Velocity of data transmission
  • Bandwidth requirements
  • Compute requirements
  • Per device cost/profitability
  • Payload contents/format
  • Device registration
  • Device security
  • Communication protocols
  • Command requirements

In addition to these best practices, consideration should be given to the cost of ingestion and impact of cloud-based stream processing solutions.

Before deploying a data warehouse, 10th Magnitude can leverage our Big Data or Azure IoT Rapid Deployment programs to help evaluate gateway and ingestion options regarding your data solutions.

Any questions? Feel free to email me and I can help you get started.

Also, to stay up-to-date on the latest cloud trends, follow 10th Magnitude on Twitter.

As always, keep calm…and cloud on.