- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2023 03:05 AM
As per usual, random non Azure or Databricks affiliated YouTuber needs to step in and tell us what to do:
Don't use the Pandas method if you want to write to ABFSS Endpoint as it's not supported in Databricks. It could also cause memory overload issues as it uses one worker instead of distributing.
Essentially, you need to land the output as a temp folder and then loop through all the files, rename your target file from the unhelpfully system generated name to what you actually want it to be called and then use dbutils.fs.cp to copy it to that actual folder you want to save the file to and then delete all the db generated fluff that you don't actually need.
TBH, this is quite an involved process for such a bread and butter endgineering task which is surprising. We have analysts in our business that aren't necessarily well versed in PySpark who are frankly gonna really struggle doing this. The fact that you have to scour the internet for a random YouTuber for the resolution is disappointing considering Azure and Databricks should provide clear instructions if this is how their (Expensive) systems work.