Researchers Uncover “Dataflow Rider” Attack Exploiting Google Cloud Pipelines to Hijack Data Processing Jobs

Cloud data processing platforms have become foundational in modern data engineering workflows, enabling scalable ETL, real-time streaming, and analytics pipelines. However, recent research highlights a subtle yet powerful attack technique against Google Cloud Dataflow that exploits weak assumptions in how pipeline components are stored and accessed. Dubbed Dataflow Rider, this method enables sophisticated compromise of data pipelines — with implications far beyond simple misconfiguration.


What Is Dataflow Rider?

Dataflow Rider is a novel attack technique identified by Varonis Threat Labs that allows malicious actors to hijack Google Cloud Dataflow jobs by modifying key pipeline artifacts residing in Google Cloud Storage (GCS) buckets. At its core, the attack abuses permissive bucket access to replace or tamper with Dataflow templates or user-defined functions (UDFs), enabling arbitrary code execution during job runs — all without breaking the pipeline logic.

Dataflow Rider belongs to a broader category of attacks that manipulate “shadow resources” — objects and configurations not validated by the data processing platform — to subvert the intended flow and inject malicious behavior.


How Google Cloud Dataflow Works (Quick Recap)

Google Cloud Dataflow is a fully managed service for batch and streaming data processing, built on Apache Beam. Operators define a pipeline — a sequence of transformations — and Dataflow orchestrates execution across worker nodes. Key components include:

  • Templates — YAML or JSON files describing job configuration.
  • Launchers — temporary Compute Engine instances that initiate jobs.
  • Workers — ephemeral compute instances that execute the actual processing logic.

These components rely heavily on files stored in GCS. Critically, Dataflow does not validate the integrity of these files before execution, assuming they are trustworthy and unmodified.


Threat Model: How Attackers Leverage Dataflow Rider

For an attacker to execute Dataflow Rider, they must gain write access to the GCS buckets containing Dataflow templates or UDFs. In many corporate environments, overly broad bucket permissions or compromised service account credentials can make this trivial.

Once on the bucket, an attacker can:

  1. Download a pipeline template or UDF.
  2. Inject malicious logic, such as code that exfiltrates credentials or sensitive data.
  3. Overwrite the original artifact without triggering errors — Dataflow blindly executes the altered content.

During the next job execution, the injected code runs transparently on worker nodes.

Common malicious outcomes include:

  • Credential theft — harvesting service account tokens from metadata endpoints.
  • Data exfiltration — sending processed data or access tokens to an attacker-controlled endpoint.
  • Business logic manipulation — altering data flows, transformations, or results.
  • Lateral movement and persistence — using stolen credentials to escalate privileges.

Example: UDF Injection for Credential Capture

In one demonstration of Dataflow Rider, an attacker targets a Python UDF used in a Dataflow pipeline that reads CSV data and writes to BigQuery. By modifying the UDF, the attacker embeds code to:

  • Retrieve the service account token from the Dataflow worker’s metadata API.
  • Exfiltrate that token to an external system.
  • Operate stealthily by ensuring injected code runs only once per worker.

This pattern — leveraging legitimate pipeline behavior to execute unauthorized logic — exemplifies how Dataflow Rider operates under the radar.


Mitigation Strategies

Because Google has classified this behavior as “intended” (i.e., not a platform vulnerability), the responsibility to manage risk falls to cloud security and operations teams. Effective mitigations include:

1. Restrict GCS Bucket Permissions

Only grant minimum required access to buckets that host Dataflow templates or UDFs. Avoid broad read/write permissions for service accounts or user groups.

2. Isolate Pipeline Artifacts

Store pipeline components — especially templates and custom code — in dedicated, restricted buckets separate from raw data buckets or development storage.

3. Monitor and Alert on Bucket Changes

Set up logging and alerts for write operations on pipeline artifact buckets. An unauthorized object update should trigger immediate investigation.

4. Use Organization Policy Constraints

Where possible, enforce policies that restrict who can create, update, or delete objects in sensitive buckets using Google Cloud IAM and VPC-SC.


Why This Matters

Dataflow Rider is a reminder that data pipelines are part of the attack surface. Modern cloud workflows assume trust in object storage that must be earned and enforced. Without proper guardrails, attackers can weaponize those assumptions to compromise data integrity, steal credentials, and infiltrate the wider cloud environment.


Conclusion

As cloud-native data processing becomes more central to enterprise analytics and AI workloads, techniques like Dataflow Rider highlight a growing class of “shadow resource” attacks. Organizations must rethink how they secure not just compute and networks, but the underlying artifacts that drive data flows. Properly configured IAM, vigilant monitoring, and architectural separation are essential defenses against this class of threat.