Databricks Community

Coffee77 · a month ago

🇪🇸 En este video, creado como recordatorio para mi mala memoria a largo plazo, explico de forma sencilla:

✅ Qué es un DataFrame
✅ Cómo se distribuye en particiones
✅ Cómo se ejecuta en un cluster (driver y workers)
✅ Qué ocurre en un shuffle
✅ Relación entre particiones, jobs, stages, shuffle y tasks
✅ Por qué es la pieza clave en Databricks y Apache Spark

🇬🇧 Do you know why the DataFrame is the most important data object in distributed processing?

In this video, created as a reminder for my poor long-term human memory, I explain in a simple way:

✅ What a DataFrame is
✅ How it's distributed across partitions
✅ How it runs in a cluster (drivers and workers)
✅ What happens during a shuffle
✅ How partitions, jobs, stages, shuffle and tasks are related
✅ Why it's the key component in Databricks and Apache Spark

Now, only in Spanish version, who knows later ...

Lifelong Solution Architect Learner | Coffee & Data

szymon_dybczak · a month ago

Thanks for sharing @Coffee77 !

Coffee77 · a month ago

Recently, I am creating some "self-reminder" videos for helping my long-term poor human memory 😞 and maybe to help others. Understand internals of Dataframes, how partitions are related to jobs, stages, shuffles and tasks and, how transformations or aggregations are executed in cluster is something that can make your project fail or succeed.

In my current project, I had to deal with complex scenarios to combine processing of small, medium and large Dataframes loaded from input files in same all-purpose cluster, with challenging requirements on concurrency (up to 50-70 concurrent jobs/pipelines), very complex DAGs, same platform/code for all inputs, and even ad-hoc IA-generated transformations. Only after fully understanding and monitoring what was going on in the background, we were able to make those pipelines work with acceptable performance with small files and very great performance with large or very large files in comparison with legacy platform. All of this, keeping CPU and memory levels stable (specially in driver node) and trying to not increment a lot our hardware/clusters costs.

I'll write a post about it when possible to help in similar scenarios and get feedback from brilliant databricks experts as you 🙂

Lifelong Solution Architect Learner | Coffee & Data