Google Dataflow Optimization: Streaming Engine

What is Streaming Engine

"By default, the Dataflow pipeline runner executes the steps of your streaming pipeline entirely on worker virtual machines, consuming worker CPU, memory, and Persistent Disk storage. Dataflow's Streaming Engine moves pipeline execution out of the worker VMs and into the Dataflow service backend. For more information, see Streaming Engine.

The Streaming Engine model has the following benefits:

  • Reduced CPU, memory, and Persistent Disk storage resource usage on the worker VMs. Streaming Engine works best with smaller worker machine types (n1-standard-2 instead of n1-standard-4). It doesn't require Persistent Disk beyond a small worker boot disk, leading to less resource and quota consumption.

  • More responsive Horizontal Autoscaling in response to variations in incoming data volume. Streaming Engine offers smoother, more granular scaling of workers.

Most of the reduction in worker resources comes from offloading the work to the Dataflow service. For that reason, there is a charge associated with the use of Streaming Engine."

How it Works

The core of the Streaming Engine is to assign a key to each message being processed and perform all processing in the context of the key. The key allows the Streaming Engine to track the state across related messages. The state is tracked, using the key, in a Key-Value persistent store. When Dataflow transforms modify the messages, these are persisted in the store. Also, by using keys, the messages can be more easily partitioned and load-balanced among workers, as each worker can process a range of the key space.

For more details, please see this article

Pricing

For streaming pipelines, the Dataflow Streaming Engine moves streaming shuffle and state processing out of the worker VMs and into the Dataflow service backend

Streaming Engine usage is billed by the volume of streaming data processed, which depends on the following:

  • The volume of data ingested into your streaming pipeline

  • The complexity of the pipeline

  • The number of pipeline stages with shuffle operation or with stateful DoFns

Examples of what counts as a byte processed include the following items:

  • Input flows from data sources

  • Flows of data from one fused pipeline stage to another fused stage

  • Flows of data persisted in user-defined state or used for windowing

  • Output messages to data sinks, such as to Pub/Sub or BigQuery

Please see this Google Cloud Platform pricing page for the latest information on costs per region.

How to Use Streaming Engine

All you need to do is add the following CLI parameter when you are deploying your Dataflow pipeline. --enable_streaming_engine . Then, your dataflow streaming job will start to use the streaming engine for processing!

  1. https://cloud.google.com/dataflow/docs/streaming-engine

  2. https://medium.com/google-cloud/streaming-engine-execution-model-1eb2eef69a8e