Making PySpark code faster with…

In this video @mehdio dives into the new experimental feature of DuckDB : running PySpark code but with DuckDB engine ⚡ Note : This is not yet supported on MotherDuck ☁️🦆 Start using DuckDB in the Cloud for FREE with MotherDuck : 📓 Resources * Github Repo of the tutorial : * Niels Claes's benchmark on SQL engines : ➡️ Follow Us LinkedIn: X (formerly known as Twitter) : Blog: 0:00 Intro 0:53 Challenges of Apache Spark development 3:24 The Java boat load 6:01 Pyspark with DuckDB demo 8:10 A word about benchmarks 8:50 Limitations 9:26 Conclusions #duckdb #pyspark #apachespark #dataengineering -------------------------------------- While Apache Spark revolutionized data pipelines, its cluster-first architecture creates overhead for modern development. This video explores a game-changing alternative: using DuckDB's experimental PySpark API compatibility. Learn how to run your existing PySpark code with the lightweight power of DuckDB under the hood, providing a faster, more efficient solution for data processing on today's powerful single machines and simplifying your developer workflow. We break down the common challenges of developing with Apache Spark for small to medium data. Spark's design introduces significant overhead for local development, CI pipelines, and incremental jobs. We examine how serverless Spark offerings from cloud providers often enforce a minimal two-node cluster, leading to higher costs and slow startup times for small jobs. This makes a lightweight, resource-efficient setup a critical need for many data engineering tasks. Discover how DuckDB's PySpark API offers a powerful solution. By replacing the JVM with DuckDB's native engine, you can dramatically shrink your container image size, leading to faster CI builds and quicker local iterations. Our hands-on demo walks you through running the same PySpark script against both pure PySpark and DuckDB with a simple import switch, showcasing significant performance gains and a more streamlined development workflow without the JVM's cold start delay. Although the PySpark API in DuckDB is still experimental, it’s already a valuable tool for accelerating your development process. We show how you can use it today to speed up unit testing for Spark jobs by leveraging DuckDB for fast in-memory transformations. This feature marks a significant milestone as the first Python code in DuckDB, making community contributions easier than ever. We'll show you where to find the code and how you can help expand this powerful new capability.

Making PySpark code faster with DuckDB

Похожее видео