cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Debu-Sinha
Databricks Employee
Databricks Employee

As a machine learning practitioner, I’m incredibly excited about the release of Apache Spark 4.0 (officially announced May 23, 2025). This landmark update brings a bevy of new features and improvements that make distributed data science and AI workloads easier and faster than ever. In this post, I’ll walk through the highlights most relevant to ML and data science – from running ML on the new Spark Connect architecture, to built-in deep learning support, native DataFrame visualization, and substantial performance gains – and share why these changes matter in practice. I’ll also touch on migration considerations (like updated language versions and configuration flags) to help you smoothly upgrade to Spark 4.0.

ML Training via Spark Connect (Lightweight Client-Server Architecture)

One of the game-changing updates is machine learning on Spark Connect. Spark Connect, introduced in the 3.x line, is Spark’s new client-server architecture that decouples the client from the cluster. With Spark 4.0, Spark Connect reaches near feature-parity with classic Spark – including support for ML pipelines and model training over the Connect protocol. In practical terms, this means I can build and train machine learning models from a lightweight Python client (the new pyspark-client ~1.5 MB) that remotely executes on a Spark cluster. The entire Spark MLlib API (DataFrame-based pipelines, transformers, estimators) is now usable via Spark Connect, so I can .fit() and .transform() models from anywhere without running a full Spark driver locally. This remote ML capability opens up new workflows – for example, training models from a Jupyter notebook on my laptop while the heavy lifting happens on a large Spark cluster in the cloud. Under the hood, Spark 4.0 expanded the API coverage of Spark Connect and introduced a new config (spark.api.mode) to easily toggle between classic and connect modes. I’ve found that switching an existing ML workflow to Spark Connect is often as simple as setting spark.api.mode="connect" when initializing Spark, with no code changes required, thanks to the strong compatibility. This is a big win for productivity: I get the scalability of Spark’s engine with the interactivity of a lightweight client.

Distributed Deep Learning with TorchDistributor and Barrier Execution

Screenshot 2025-05-30 at 12.21.12 PM.png

Spark 4.0 makes distributed deep learning easier: the TorchDistributor API launches your training function (main_fn) on each executor in parallel (diagram from NVIDIA). Executors run the same code and communicate as peers, leveraging Spark’s barrier execution mode for synchronization.

For those of us integrating deep learning into data pipelines, Spark 4.0 brings mature support for distributed training of neural networks. Spark’s MLlib now includes the TorchDistributor class (introduced in 3.4, improved in 4.0) to natively run PyTorch code on Spark clusters. You can launch a training function simultaneously on multiple Spark executors – each can utilize a GPU – and Spark will handle coordinating the tasks (using its barrier execution mode to start all tasks together). I was impressed with how little code change is required: you wrap your existing PyTorch training loop (or PyTorch Lightning trainer) in a TorchDistributor.run(...) call, and Spark takes care of spinning up the workers and aggregating the results. The TorchDistributor API was clearly inspired by prior solutions for TensorFlow (the spark-tensorflow-distributor), and Spark 4.0 continues to support distributed TensorFlow training in a similar vein. Under the hood, Spark’s robust barrier scheduling ensures all parallel training tasks start together and can communicate (e.g., for all-reduce operations) just as if you were using Horovod or MPI – but without manually managing cluster setup.

Beyond training, Spark also makes distributed model inference more efficient. In Spark 3.4, the developers introduced a convenient predict_batch_udf API for batch inference on DataFrames. I frequently use this in Spark 4.0 to apply complex models (like Hugging Face transformers for NLP) to large datasets in parallel. You define a function that loads your model and does prediction on a NumPy batch, and Spark will handle vectorizing the data into batches, shipping it to executors, and caching the model in each worker process. This avoids the overhead of one-record-at-a-time UDF scoring. In practice, I’ve seen this significantly speed up inference on big data. For example, using predict_batch_udf with a Hugging Face SentenceTransformer to embed text, I could distribute the workload across 8 executors and process millions of sentences much faster than a single-node approach – all with a few lines of PySpark code. The combination of these features means Spark 4.0 is not just for “traditional” ML on structured data – it’s equally a powerful platform for distributed deep learning tasks, integrating with the Python AI ecosystem (PyTorch, TensorFlow, Hugging Face) smoothly.

Native DataFrame Visualization (No Pandas Required)

Another quality-of-life enhancement I love in Spark 4.0 is the new native plotting support in PySpark DataFrames. Previously, to quickly visualize data from a Spark DataFrame, I would often collect a subset to pandas and use pandas.DataFrame.plot() or resort to external tools. Now Spark 4.0 adds a built-in DataFrame.plot() API for common charts (histograms, bar charts, scatter plots, etc.) directly on PySpark DataFrames. This means I can generate exploratory visuals on distributed datasets without converting to pandas. Under the hood, Spark implements these plots by leveraging its SQL engine and optionally integrations with libraries like Plotly for rich output. For example, Spark 4.0 can render a histogram of a column by computing the bin counts in Spark SQL (distributed) and then plotting the result for you. Kernel density estimates (KDE) and box plots are also supported. This native plotting API dramatically streamlines the EDA (exploratory data analysis) process – I can quickly check the distribution of a feature or visualize relationships while staying in PySpark. The charts can be rendered in notebooks or saved, and because the heavy lifting (aggregations, etc.) happens in Spark, it scales to large data. In my testing, I was able to plot a histogram of 100 million data points in a Spark DataFrame with just one line of code, something that would have been painful to do manually before. This feature makes Spark much more user-friendly for data scientists.

Performance Improvements for ML Workloads (20–50% Faster, Smarter Engine)

No Spark release is complete without performance boosts, and 4.0 delivers plenty. The Spark engine has evolved with optimizations that particularly benefit ML and ETL workloads. Here are some notable improvements contributing to Spark 4.0’s 20–50% speedups in various workloads:

  • Adaptive Query Execution (AQE) Enhancements: Spark’s query optimizer can now make even smarter decisions at runtime. Spark 4.0 refines AQE (first introduced in 3.0) so much that some complex query workloads run up to 30% faster than on Spark 3.x (and 3× faster than on Spark 2.x) according to internal benchmarks. AQE can automatically adjust join strategies, reduce data skew, and optimize shuffle partitions on the fly. These improvements benefit ML data prep pipelines which often involve joins and aggregations on large tables – in my experience, model training jobs end up spending less time waiting on shuffles and skewed joins, thanks to AQE. There are even cases of overall job runtime dropping by roughly one-fifth simply by upgrading to Spark 4.0, with no code changes, due to these smarter optimizations.

  • Arrow & Pandas UDF Boosts: Cross-language overhead is lower now. Apache Arrow is upgraded under the hood (to Arrow 18) and integration with Pandas UDFs is improved. This means faster serialization and data transfer between the JVM and Python, which is great for ML workloads using Python code. Pandas UDFs in Spark 4.0 take advantage of Arrow’s columnar format for efficiency. In practice, I’ve noticed heavy Python UDF sections running noticeably faster and using less memory. If you use vectorized UDFs for feature engineering or apply models, Spark 4.0’s optimized data exchange can easily give double-digit percentage speedups.

  • Memory and Shuffle Optimizations (RocksDB Backend): Spark 4.0 shifts more heavy lifting off the JVM heap, reducing GC pressure for large jobs. Notably, the default implementation for the external shuffle service and for streaming state management now uses a RocksDB backend instead of in-memory or LevelDB storage. RocksDB is a fast key-value store that keeps data off-heap (native memory) and on disk. By using RocksDB for shuffle data and state, Spark 4.0 can handle very large shuffles and streaming states with much lower memory usage on the JVM, avoiding out-of-memory errors and GC stalls. This translates to more stable long-running ML pipelines and the ability to scale to higher loads. One Spark streaming job I worked on (maintaining online ML feature state) saw about a 40% drop in JVM heap usage after we enabled RocksDB state store – a huge win for reliability. The migration guide notes that spark.shuffle.service.db.backend defaults to ROCKSDB now, which you can override if needed, but in most cases the new default is superior.

  • GPU Acceleration Support: While not brand-new in 4.0, Spark continues to improve its support for GPUs and accelerators, which is crucial for AI workloads. Projects like NVIDIA’s RAPIDS Accelerator for Spark and Spark RAPIDS ML integrate nicely, allowing Spark to push certain operations (e.g., XGBoost training, cuML algorithms) to GPUs for 10× speed-ups in some cases. Spark 4.0 doesn’t radically change GPU APIs at the core, but it benefits from all the 3.x groundwork. If you have access to GPUs, Spark 4.0 is the best Spark yet for utilizing them – the combination of the Arrow optimizations, better task scheduling, and libraries like RAPIDS can yield huge performance boosts for data prep and model training on GPUs. I anticipate even more GPU-aware improvements in future Spark releases, but 4.0 already solidifies Spark as a viable distributed GPU computing platform for data science.

Overall, Spark 4.0 feels snappier and more efficient. In my own ML workflows, I’ve observed end-to-end training pipelines running ~20-30% faster after upgrading to 4.0, thanks to the engine improvements above. Resource utilization (CPU, memory) is more balanced, and the cluster just handles big data smoother – which ultimately means I iterate on models faster.

New Data Types and SQL Enhancements (Better for Semi-Structured Data)

On the data handling side, Spark 4.0 introduces some very useful features for dealing with complex or semi-structured data, as well as general SQL improvements. A headline addition is the new VARIANT data type in Spark SQL. Similar to Snowflake’s variant type, this allows a column to hold semi-structured data (e.g. JSON or XML fragments, maps, arrays of mixed types) while still being queryable. Essentially, a VARIANT column can store arbitrary objects or values without losing schema information. In practice, this is great for data science workloads where you might have nested JSON data or varying schema across records – you can keep it in a DataFrame as a single column of type VARIANT and use built-in functions to introspect or unwrap it. Spark 4.0 also adds a native XML data source and parser, so you can read XML files directly into DataFrames (no more third-party libraries or custom parsing for XML). Combining XML support with the VARIANT type, Spark becomes much friendlier to ingesting and exploring NoSQL-style or document data.

SQL users will also appreciate that Spark 4.0 is now ANSI SQL compliant by default. The ANSI mode (which was optional in 3.x) brings stricter handling of syntax and corner cases – e.g., making Spark SQL behave more like standard SQL (throwing errors on precision loss, using true NULL semantics, etc.). This default ANSI mode can catch potential bugs in queries and makes Spark SQL more predictable if you come from a SQL background. If your old Spark jobs rely on the lenient behavior, you can still disable ANSI mode, but I recommend embracing it for correctness.

There are numerous other SQL improvements that aid data analysis. Spark 4.0 adds support for SQL stored procedures and control flow (via SQL scripting features). This means you can write multi-statement SQL routines with variables, IF/WHILE logic, etc., directly in Spark SQL – useful for ETL tasks orchestrated purely in SQL. We also get SQL UDFs (user-defined functions defined in SQL) which help encapsulate complex logic within queries. Another neat addition is the PIPE syntax (|> operator) in SQL that allows piping the output of one query into the next, making SQL workflows more readable in a functional style. String handling is improved too: Spark now supports collation for string columns (finally respecting locale-specific sorting and comparison rules). All these enhancements make Spark’s SQL more powerful for data exploration and prep – whether you prefer writing SQL or using DataFrame APIs, Spark 4.0 provides a richer feature set that brings it closer to the capabilities of traditional RDBMS systems while retaining massive scalability.

Streaming and Real-Time ML: Arbitrary Stateful Processing (transformWithState)

Real-time data processing in Spark Structured Streaming also gets a significant boost, which is great news for online ML and event-driven applications. Spark 4.0 introduces a new Arbitrary Stateful Processing API v2, anchored by the transformWithState operation. This is a unification and improvement over the older mapGroupsWithState/flatMapGroupsWithState APIs, offering a more flexible way to implement custom stateful stream logic. With transformWithState, you can arbitrarily read and update state for each event in a stream, using a user-defined state object (or multiple state variables) that Spark will manage and checkpoint. Unlike the old API, the new one supports multiple state data types – for example, you can use ValueState (one value per key), ListState (appendable list per key), or MapState (key-value map per key) depending on your use case. These state types will feel familiar if you’ve used Apache Flink, and they allow more complex patterns (like aggregating a list of recent events or maintaining a map of counts) to be done easily in Spark. The API also supports event-time timers and time-to-live (TTL) eviction of state out-of-the-box, which is crucial for long-running streaming jobs to not accumulate unbounded state.

For ML, this means you can do things like maintain rolling feature statistics, implement online algorithms, or monitor model drift in a streaming fashion, all within Spark. I’m personally excited to use transformWithState for anomaly detection on streaming data – the flexibility to store custom state for each entity (like rolling metrics) makes Spark much more powerful in this area.

To aid with debugging streaming state, Spark 4.0 also adds a State Store Data Source (also called the State Reader API). This lets you query the contents of a streaming query’s state (the intermediate aggregated info) as if it were a table. In practice, when running a streaming ML application, I can connect a Spark SQL session to the checkpoint directory and perform queries to see what’s in memory for each key. This is invaluable for troubleshooting and verifying that your stateful stream logic is working as intended, without awkwardly logging every update. The State Store Data Source provides both a high-level “state metadata” view (keys, counts, etc.) and a detailed “statestore” view of the actual data. If an online model isn’t updating or a feature store is not computing as expected, you can now inspect the state live rather than guessing. This kind of introspection was hard in previous Spark versions and often required custom workarounds – Spark 4.0 makes it straightforward.

Improved Developer Experience (PySpark APIs, UDF Profiling, Error Reporting, Multi-Language Support)

From a developer’s perspective, Spark 4.0 feels more polished and user-friendly. There are several enhancements aimed at making us more productive when writing, debugging, and optimizing Spark applications:

  • New PySpark APIs and Python User-Defined Table Functions: Python remains a first-class citizen in Spark. In 4.0, we get a new Python Data Source API that allows writing custom data sources or sinks in Python (both for batch and streaming). This is a big win if you need to connect Spark to some system for which there isn’t a JVM connector – you can implement it in Python instead. Spark 4.0 also supports Python UDTFs, i.e. user-defined table functions that return multiple rows, with support for polymorphic schemas. This means you can write a Python function that takes a value and returns a table (multiple rows/columns), and use it in Spark SQL – handy for tasks like splitting text into multiple records, or any operation where a single input yields many outputs. These Python API expansions continue to narrow the gap between PySpark and Scala Spark; as a data scientist who prefers Python, I appreciate being able to do more without dropping to Scala/Java.

  • Unified UDF Profiler: Tuning UDFs (especially in PySpark) has historically been challenging. Spark 4.0 introduces a unified profiling tool for PySpark UDFs. Essentially, when you enable it, Spark will collect metrics and profiles of your UDF execution, so you can see how much time was spent in the Python functions, how much data was transferred via Arrow, etc. This has already helped me identify bottlenecks – for example, discovering that a particular UDF was slow due to Python object conversion overhead, which led me to rewrite it with pandas vectorization. Having built-in profiling means we can optimize our pipelines more scientifically. No more blind guessing why a UDF is slow; Spark can tell you if it’s CPU-bound, serialization-bound, etc.

  • Better Error Messages and Logging: The development experience is smoother thanks to standardized error classes and structured logging. Spark 4.0 rolled out a Structured Logging Framework that outputs logs in JSON (making them easier to search and parse). It also introduced an Error Class framework for consistent, informative error messages. In practice, this means when something goes wrong, the error messages are more descriptive and include error codes you can look up. I’ve noticed things like null pointer errors or misconfiguration issues now have clearer diagnostics. Additionally, Spark 4.0 has an improved mechanism for surfacing errors from the Dataset API (with context about where in your code the error occurred). All of this reduces the friction in debugging Spark jobs. I also liked that Spark 4.0 formalized a Behavior Change log/process – essentially documenting any breaking changes or behavior differences (like the ANSI mode default) clearly, so developers aren’t caught off guard. Upgrading to 4.0 has been one of the smoothest Spark upgrades for me, partly due to these efforts.

  • Multi-Language Spark and New Connect Clients: Spark is breaking out of the JVM language barrier. With Spark Connect’s decoupled architecture, we’re seeing official/community Spark clients for languages like Go, Rust, and Swift emerge in the Spark 4.0 ecosystem. The Spark project provided a reference Go client and Swift client, and there’s even a Rust client in development. As a test, I wrote a small Rust program using the Spark Connect Rust client to aggregate some data on a Spark cluster – it was thrilling to run Spark operations natively from Rust! This polyglot support means you can integrate Spark processing into a wider range of applications. For instance, a data engineering service written in Go can directly query and process data in a Spark cluster (no need to spawn a subprocess or use JDBC). Or an iOS app (Swift) could offload a heavy query to Spark. While Python and Scala are still the primary languages for Spark, I’m excited to see this expansion – it shows Spark’s potential to be a universal data processing layer accessible from anywhere. And importantly, these new clients don’t sacrifice capability: thanks to Spark Connect’s design, the full DataFrame API is (almost) available in each of these languages, as of Spark 4.0.

Compatibility and Migration Considerations

Upgrading to Spark 4.0 is a pretty straightforward process in my experience, but there are a few important compatibility notes to be aware of. Firstly, Spark 4.0 requires Java 17+ and Scala 2.13 now, dropping support for older Java/Scala versions. If you’re coming from Spark 3.x, ensure your environment is updated to JDK 17 and that any custom Scala libraries are compiled for 2.13. The good news is that Scala 2.13 has been around for a while and most ecosystems have adopted it, so this wasn’t a big hurdle for us. On the Python side, Spark 4.0 drops support for Python 3.7/3.8; Python 3.9 or above is required. This mainly affects older environments – most folks are on Python 3.10+ by now, but just double-check your environment.

One notable change is that Spark 4.0 deprecates SparkR, the R API for Spark. If you were using SparkR for any reason, consider migrating to alternatives (like the sparklyr package which interfaces Spark with R, or just use PySpark). The deprecation indicates SparkR may be removed in a future release, so while it still works in 4.0, it’s a good time to plan an exit strategy from SparkR. For machine learning specifically, the Spark MLlib APIs remain stable in 4.0 – the DataFrame-based ML API is still the recommended approach and it hasn’t seen breaking changes. In fact, certain experimental parts of MLlib have matured: for example, the separate pyspark.ml.connect module (for Spark Connect ML) that was previewed in 3.x has been deprecated because now the main ML API works over Spark Connect natively. This simplifies things – you don’t need special handling for ML on connect. There are a few new algorithms/features in MLlib (I spotted a new feature transformer for target encoding in 4.0), but existing algorithms and pipeline stages behave as before. So your ML code should mostly just run on 4.0, often faster than before.

For a smooth migration, I recommend reading the official Spark 4.0 migration guide and enabling the optional compatibility settings if needed. Spark 4.0 introduces a spark.api.mode configuration (mentioned earlier) which defaults to “classic” for now – meaning your jobs will use the old behavior unless you opt-in to Spark Connect by setting it to “connect”. This flag provides a safety net: you can try Spark Connect in 4.0 and if something doesn’t work, switch back to classic mode easily. In my team, we plan to gradually turn on Spark Connect mode for our applications, one by one, to verify everything works, and eventually run everything in Connect mode to enjoy its benefits. Other than that, just watch out for the ANSI SQL changes (if you had non-ANSI-compliant SQL code, you might need to adjust or disable ANSI mode). The improved error messages in 4.0 actually helped us catch a couple of subtle issues (like relying on silent cast of out-of-range values, which ANSI mode rightfully errors on now).

Bottom line: Apache Spark 4.0 is a robust, future-ready platform that I’m eager to use for upcoming ML projects. The combination of ease-of-use improvements (like plotting and better APIs), deep integration of ML/DL frameworks, and the raw performance gains (both at the engine level and via better use of hardware like GPUs) makes it a compelling upgrade. The first time I ran one of our heavy ML pipelines on Spark 4.0, I was delighted to see it complete nearly 40% faster than before and with fewer memory hiccups – that’s the kind of improvement that directly translates to faster experimentation and iteration in data science. Spark 4.0 manages to bring these benefits while staying compatible with existing code and workflows for the most part. If you’re in the data engineering or data science space and haven’t explored Spark 4.0 yet, I highly recommend giving it a try. It feels like Spark has truly entered a new era where it can serve as a unified engine for all things data – from ETL to interactive analytics to machine learning and streaming – with elegance and efficiency. Happy Sparking with 4.0!

Sources