Types of Data Engineering Architecture: Part 2
In part 1, we talked about the main types of data Architectures.
In the early to mid-2010s, the popularity of working with streaming data exploded with the emergence of Kafka as a highly scalable message queue. Data engineers needed to figure out how to reconcile batch and streaming data into a single architecture.
The Dataflow Model and Unified Batch and Streaming
To solve this, Google developed the Dataflow model and the Apache Beam framework that implements this model. The core idea in the Dataflow model is to view all data as events, as the aggregation is performed over various types of windows.
Ongoing real-time event streams are unbounded data. Data batches are simply bounded event streams, and the boundaries provide a natural window. Engineers can choose from various windows for real-time aggregation, such as sliding or tumbling. Real-time and batch processing happens in the same system using nearly identical code.
The philosophy of “batch as a special case of streaming” is now more pervasive. Various frameworks such as Flink and Spark have adopted a similar approach.
Architecture for IoT
The Internet of Things is the distributed collection of devices — computers, sensors, mobile devices, smart home devices, and anything else with an internet connection. IoT data is generated from devices that collect data periodically or continuously from the surrounding environment and transmit it to a destination. The smartphone revolution created a massive IoT swarm, such as smart thermostats, car entertainment systems, smart TVs, and smart speakers.
Devices
IoT is one of the major areas where a Data Engineer is needed, because you have to work with a lot of data incoming through different devices. You don’t necessarily need to know the inner details of IoT devices but should know what the device does, the data it collects, any edge computations or ML it runs before transmitting the data, and how often it sends data.
IoT Gateway
An IoT gateway is a hub for connecting devices and securely routing devices to the appropriate destinations on the internet. While you can connect a device directly to the internet without an IoT gateway, the gateway allows devices to connect using extremely little power. It acts as a way station for data retention and manages an internet connection to the final data destination.
Typically, a swarm of devices will utilize many IoT gateways, one at each physical location where devices are present:
Ingestion
Ingestion begins with an IoT gateway. From there, events and measurements can flow into an event ingestion architecture. Of course, other patterns are possible. For instance, the gateway may accumulate data and upload it in batches for later analytics processing.
In remote physical environments, gateways may not have connectivity to a network much of the time. They may upload all data only when they are brought into the range of a cellular or WiFi network. The point is that the
diversity of IoT systems and environments presents complications that engineers must account for in their architectures and downstream analytics.
Storage
Storage requirements will depend a great deal on the latency requirement for the IoT devices in the system.
For example, for remote sensors collecting scientific data for analysis at a later time, batch object storage may be perfectly acceptable.
However, near real-time responses may be expected from a system backend that constantly analyzes data in a home monitoring and automation solution. In this case, a message queue or timeseries database is more appropriate.
Serving
Serving patterns are incredibly diverse. In a batch scientific application, data might be analyzed using a cloud data warehouse and then served in a report. Data will be presented and served in numerous ways in a home monitoring application.
Data will be analyzed in the near time using a stream-processing engine or queries in a time-series database to look for critical events such as a fire, electrical outage, or break-in. Detection of an anomaly will trigger alerts to the homeowner, the fire department, or other entity. A batch analytics component also exists — for example, a monthly report on the state of the home.
The following figure shows one significant serving pattern for IoT:
Data Mesh
The data mesh is a recent response to sprawling monolithic data platforms, such as centralized data lakes and data warehouses. The data mesh attempts to invert the challenges of centralized data architecture, taking the concepts of domain-driven design (commonly used in software architectures) and applying them to data architecture.
Because the data mesh has captured much recent attention, you should be aware of it. A big part of the data mesh is decentralization.
There are 4 key components of a Data Mesh:
- Domain-oriented decentralized data ownership and architecture
- Data as a product
- Self-serve data infrastructure as a platform
- Federated computational governance
The following figure shows a simplified version of a data mesh architecture, with the three domains interoperating:
That’s it! Thanks for reading 🎉