- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-04-2026 02:39 PM
I would definitely agree with Louis on this. He provided a some great suggestions even from the code/logic perspective.
Couple of things I would suggest while developing logic.
1. As Louis mentioned, when we use Databricks always think about Spark(Distributed compute framework), instead of Pandas. Transform your data as needed, filter out only necessary data to run your model using Taylor.
2. Based on the data size, use Spark UI to check the logs and utilisation of your memory, CPU, storage this helps if you need to change your compute type. Since your whole logic is written in Pandas it's likely to be that only your Driver node is working. So, use standalone cluster instead of all purpose cluster (if you want to keep everything in Pandas). If you observe your CPU or memory utilisation 100%, then try increase the capacity of the node. I would suggest use memory optimized nodes.
3. As Louis mentioned divide your data into chunks and then run your model instead of feeding all the data at once.
Hope this helps.
Regards,
Saicharan