- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2022 09:54 PM
I’m using YakeKeywordExtraction from SparkNLP to extract keywords, I’m facing an issue in saving result (spark data frame) to ADLS gen1 delta tables from Azure Databricks. Data frame comprise of strings in Struct schema format and I’m converting the struct schema to normal format by exploding and extracting required data. When I try to save this data frame to any of the target data sources ADLS/DB/toPandas/CSV. Max No of rows present in data frame would be 20 with 7 columns. The computational time for this notebook is 10min. But when the Final Df is ready saving the extracted data is taking close to 55hrs. I have tried to curb this time by implementing all types of optimization techniques listed out in various forums/communities like using execution.arrow.pyspark, RDD’s etc. nothing worked.
Code to Explode results:
scores = result \
.selectExpr("explode(arrays_zip(keywords.result, keywords.metadata)) as resultTuples") \
.selectExpr("resultTuples['0'] as keyword", "resultTuples['1'].score as score")Code to write to ADLS:
scores.write.format("delta").save("path/to/adls/folder/result")