Spark is a distributed data processing engine that is widely used in batch processing and stream processing platforms. Building such platforms comes with a fair share of challenges beyond those required for continuous delivery in the microservices world such as data drift, bad data, and data security.
Similar to the Twelve-Factor app that is an outstanding methodology/set of principles for building web apps, we at Sahaj have realized over time that a set of concepts/patterns can be applied to building data processing applications using frameworks like Spark. Through this article, we attempt to document these observations that we believe can serve as ideal practices for defining a basic structure to build production-grade Spark applications.
I. Codebase
One codebase tracked in revision control, many deploys.
Ensure that the code is version controlled and changes are promoted from a lower environment (dev/qa) to a higher environment. This ensures that you can test changes before rolling them out and in case of any issues, you can easily rollback. It further provides a traceable artifact for investigating issues.
II. Dependencies
Explicitly declare and isolate dependencies.
There are several choices for declaring and locking dependencies. To pin dependencies (inclusive of transitive dependencies) to specific versions and have a reproducible build, tools such as Poetry/Pipenv are available in Python, and in Java/Scala, sbt, Gradle, and Maven do the same.
In the Python ecosystem, several libraries rely on binary dependencies directly installed in the system. In such scenarios, use containerized images to ensure that the system dependencies are packaged into a tested image and are promoted across environments.
III. Config
Store config in the environment.
Spark lets you override several parameters at runtime via environment variables or command line parameters to tweak application parameters and handle changes in data volume or infrastructure without changing code.
For other configurations, parameterise using named CLI args. Every language has libraries for processing command-line arguments that can validate the types as well as supported values and ensure that the application doesn’t run with junk values.
Externalise secrets and access them at runtime. Use tools like Vault, AWS Secrets Manager and service roles to ensure that keys can be easily rotated.
IV. Backing services
Treat backing services as attached resources.
Spark provides abstractions for dealing with various data sources such as JDBC, HDFS, S3 etc. By treating these as resources and externalising configuration, you can avail simplified testing, pluggable resources, and reduced vendor lock-in. Use custom connectors, readers instead of directly interacting with external sources like http/ftp using client libraries so that the spark job focuses primarily on business logic.
Build/Use abstractions that make it easier to run jobs locally with minimal external dependencies. For example, Instead of relying on a shared s3/dev database location, treat them as attached resources so that locally it can be swapped with the filesystem/stub apis.
For infra services such as performance metrics, Spark supports configurable sinks so that key metrics are pushed to systems such as Prometheus, StatsD, and Ganglia.
V. Build, release, run
Strictly separate build and run stages.
For Spark applications, during the build stage, you should create a packaged artifact inclusive of all dependencies. For Python, this involves creating a wheel for your target system. For Scala/Java, it involves creating an uber jar using tools like shadow. To reiterate, using a containerized image isolates any system dependencies and helps with reproducible builds that can be promoted from one environment to another.
Invest heavily in writing tests and clean code practices to provide safety nets as well as to avoid data bugs. Depending on the data, the cost of data bugs can be astronomical and fixing it would involve recomputing the data along with any derived insights. Run tests automatically in a pipeline before building the artifact.
If the configuration is fully externalised, the release stage is mostly a no-op and a workflow orchestrator can ensure that the right configuration is applied and used in each environment.
VI. Data Schema
Validate data schema and semantics
Prefer data formats such as Parquet/Avro/ORC so that schema is implicitly a part of the data and schema does not need to be explicitly managed/specified across separate data jobs. For CSV/JSON, explicitly specify the schema so that unintentional changes are caught early.
Create separate schema versions whenever the semantics of existing columns have changed or columns have been introduced/deleted. The intent is to avoid unintentional impact on downstream consumers.
Fail fast! Validate schema and data semantics as part of your job for all inputs so that unintentional changes are caught early. Frameworks such as Deequ, GreatExpectations let you integrate checks into your data pipeline.
VII. Immutability
Prefer atomic, repeatable jobs with partitioned data
Immutability in data is similar to immutability in programming. Since tasks can fail for various reasons, it is important to ensure that the re-computation doesn’t corrupt the output state. Partition your data so that older/irrelevant data can be phased out easily. For batch processing, fail on presence of existing data and write to a separate output location for each run.
For a deterministic model/job, this also helps in rolling out newer versions of the model since they can be easily compared against a previous model for a given input.
Write each application as an atomic job that does one thing well. This ensures that a rerun on failure doesn’t affect anything else. Compose multiple jobs using a workflow orchestrator.
VIII. Scalability
Scale out horizontally with balanced tasks
For Spark applications, leverage parallelisation and balanced tasks to ensure that the tasks can scale with additional infrastructure. Use CPU/memory/disk optimally and monitor systems to ensure infrastructure usage scales on the basis of data.
Don’t collect data on the driver node which might exceed the resources available to a single node. Instead, identify mechanisms of distributing the processing across all nodes. For example, Spark lets you define user defined functions (UDF) which can do complex model/algorithmic processing on your data across nodes/tasks. If you need to collate and create a single artifact (e.g. a geojson file) , this can be done in a separate stage which unifies smaller parts with different infrastructure considerations.
IX. Disposability
Idempotent tasks with no side-effects on re-run
Given the distributed nature of Spark, be wary of issues that arise due to intermittent availability of backing services and the implications of rerunning tasks and side-effects such as updating data stores/HTTP endpoints.
Cloud infrastructure makes it easy to spawn on-demand infrastructure with spot pricing which further helps reduce cost and manage immutable infrastructure. However, this also increases the odds of infrastructure instances being terminated on short notice.
X. Dev/prod parity
Keep development, staging, and production as similar as possible.
As far as possible, keep the environment similar as it helps ensure that cluster sizing, debugging, model validation etc., can all be done in the dev environment.
However, concerns such as cost, privacy, time etc., might lead to choosing alternatives such as sampling and anonymisation of data for test environments. Scale down infrastructure/configuration proportionately if you choose to sample. This offers a good middle ground where data on dev still lets you know of data bottlenecks and other issues before they hit the production environment.
XI. Observability
Provide insights into runtime application behaviour
Use structured logging to provide visibility into the application behaviour. The event logs and spark history server provide detailed insight into the processed tasks and can help in identifying unbalanced tasks, their duration, DAG visualisation, shuffle issues etc.
Publish infrastructure and spark metrics to monitoring systems so that the infrastructure can be effectively utilised and issues due to incorrectly configured system resources can be investigated.
XII. Data Quality
Define data quality checks to safeguard against data drift
In addition to testing, running data quality checks in your data pipeline helps ensure that any degradation due to Data Drift is caught early before downstream jobs consume the data.
Define a measure of accuracy for any models and ensure the accuracy score meets a static/dynamic threshold.
Choose actionable data quality metrics which can be acted upon to avoid alert fatigue and noise.