cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

๐Ÿ‡ช๐Ÿ‡ธ Por quรฉ el DataFrame es el objeto de datos mรกs importante en el procesamiento distribuido

Coffee77
Honored Contributor II

๐Ÿ‡ช๐Ÿ‡ธ En este video, creado como recordatorio para mi mala memoria a largo plazo, explico de forma sencilla:

  • โœ… Quรฉ es un DataFrame
  • โœ… Cรณmo se distribuye en particiones
  • โœ… Cรณmo se ejecuta en un cluster (driver y workers)
  • โœ… Quรฉ ocurre en un shuffle
  • โœ… Relaciรณn entre particiones, jobs, stages, shuffle y tasks
  • โœ… Por quรฉ es la pieza clave en Databricks y Apache Spark

๐Ÿ‡ฌ๐Ÿ‡ง Do you know why the DataFrame is the most important data object in distributed processing?

In this video, created as a reminder for my poor long-term human memory, I explain in a simple way:

  • โœ… What a DataFrame is
  • โœ… How it's distributed across partitions
  • โœ… How it runs in a cluster (drivers and workers)
  • โœ… What happens during a shuffle
  • โœ… How partitions, jobs, stages, shuffle and tasks are related
  • โœ… Why it's the key component in Databricks and Apache Spark

Now, only in Spanish version, who knows later ...


Lifelong Solution Architect Learner | Coffee & Data
3 REPLIES 3

szymon_dybczak
Esteemed Contributor III

Thanks for sharing @Coffee77 !

Coffee77
Honored Contributor II

Recently, I am creating some "self-reminder" videos for helping my long-term poor human memory ๐Ÿ˜ž and maybe to help others. Understand internals of Dataframes, how partitions are related to jobs, stages, shuffles and tasks and, how transformations or aggregations are executed in cluster is something that can make your project fail or succeed.

In my current project, I had to deal with complex scenarios to combine processing of small, medium and large Dataframes loaded from input files in same all-purpose cluster, with challenging requirements on concurrency (up to 50-70 concurrent jobs/pipelines), very complex DAGs, same platform/code for all inputs, and even ad-hoc IA-generated transformations. Only after fully understanding and monitoring what was going on in the background, we were able to make those pipelines work with acceptable performance with small files and very great performance with large or very large files in comparison with legacy platform. All of this, keeping CPU and memory levels stable (specially in driver node) and trying to not increment a lot our hardware/clusters costs.

I'll write a post about it when possible to help in similar scenarios and get feedback from brilliant databricks experts as you ๐Ÿ™‚


Lifelong Solution Architect Learner | Coffee & Data

szymon_dybczak
Esteemed Contributor III

That's such a great idea. Can't wait for another post ๐Ÿ™‚