- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a month ago
Recently, I am creating some "self-reminder" videos for helping my long-term poor human memory 😞 and maybe to help others. Understand internals of Dataframes, how partitions are related to jobs, stages, shuffles and tasks and, how transformations or aggregations are executed in cluster is something that can make your project fail or succeed.
In my current project, I had to deal with complex scenarios to combine processing of small, medium and large Dataframes loaded from input files in same all-purpose cluster, with challenging requirements on concurrency (up to 50-70 concurrent jobs/pipelines), very complex DAGs, same platform/code for all inputs, and even ad-hoc IA-generated transformations. Only after fully understanding and monitoring what was going on in the background, we were able to make those pipelines work with acceptable performance with small files and very great performance with large or very large files in comparison with legacy platform. All of this, keeping CPU and memory levels stable (specially in driver node) and trying to not increment a lot our hardware/clusters costs.
I'll write a post about it when possible to help in similar scenarios and get feedback from brilliant databricks experts as you 🙂