Apache Spark 3.0 adds Nvidia GPU support for machine learning

On May 16, 2020

Apache Spark, the in-memory big data processing framework, will become fully GPU accelerated in its soon-to-be-released 3.0 incarnation. Best of all, today's Spark applications can take advantage of the GPU acceleration without modification; existing Spark APIs all work as-is.

The GPU acceleration components, provided by Nvidia, are designed to complement all phases of Spark applications including ETL operations, machine learning training, and inference serving.

Nvidia's Spark contributions draw on the RAPIDS suite of GPU-accelerated data science libraries. Many of RAPIDS' internal data structures, like dataframes, complement Spark's own, but getting Spark to use RAPIDS natively has taken nearly four years of work.

Spark 3.0 speedups don't come solely from GPU acceleration. Spark 3.0 also reaps performance gains by minimizing data movement to and from GPUs. When data does need to be moved across a cluster, the Unified Communication X framework shuttles it directly from one block of GPU memory to another with minimal overhead.

According to Nvidia, a preview release of Spark 3.0 running on the Databricks platform yielded a seven-fold performance improvement when using GPU acceleration, though details about the workload and its dataset were not available.

No firm date has been given for general availability of Spark 3.0. You can download preview releases from the Apache Spark project website.