Dagster & Spark
Running Spark code often requires submitting code to a Databricks or EMR cluster. The Pyspark integration provides a Spark class with methods for configuration and constructing the spark-submit command for a Spark job.
This page is focused on using Pipes with specific Spark providers, such as AWS EMR or Databricks. Our guide about Building pipelines with PySpark provides more information on using Dagster Pipes to launch & monitor general PySpark jobs.
About Apache Spark
Apache Spark is an open source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. It also provides libraries for graph computation, SQL for structured data processing, ML, and data science.
Using Dagster Pipes to run Spark jobs
Dagster pipes is our toolkit for orchestrating remote compute from Dagster. It allows you to run code outside of the Dagster process, and stream logs and events back to Dagster. This is the recommended approach for running Spark jobs.
With Pipes, the code inside the asset or op definition submits a Spark job to an external system like Databricks or AWS EMR, usually pointing to a jar or zip of Python files that contain the actual Spark data transformations and actions.
You can either use one of the available Pipes Clients or make your own. The available Pipes Clients for popular Spark providers are:
Existing Spark jobs can be used with Pipes without any modifications. In this case, Dagster will be receiving logs from the job, but not events like asset checks or attached metadata.
Additionally, it's possible to send events to Dagster from the job by utilizing the dagster_pipes module. This requires minimal code changes on the job side.
This approach also works for Spark jobs written in Java or Scala, although we don't have Pipes implementations for emitting events from those languages yet.
Spark Declarative Pipelines (SDP) Integration
Dagster also provides native support for the new Spark Declarative Pipelines (SDP) framework (Spark 4.0+). For organizations using SDP to define datasets via decorators or SQL, Dagster offers a dedicated component to seamlessly orchestrate these pipelines without duplicating code.
The SparkDeclarativePipelineComponent leverages the spark-pipelines CLI to automatically discover your datasets and dependencies using spark-pipelines dry-run.
Key benefits include:
- Auto-Discovery: No need to manually define
AssetSpecs. The component infers Materialized Views and Streaming Tables automatically at load time. - Incremental & Full Refresh: Natively supports both
--refreshand--full-refreshexecution modes. - Real-time Observability: Streams execution logs and events directly back to the Dagster UI during execution.
- UI Clutter Reduction: Pipeline-scoped intermediate datasets (Temporary Views) are automatically filtered out from the Dagster lineage unless explicitly overridden.
You can quickly initialize a new SDP component in your Dagster project using the dg CLI:
dg scaffold component dagster_spark.SparkDeclarativePipelineComponent my_sdp_pipeline --pipeline-spec-path ./path/to/spark-pipeline.yml