Autoloader not ingesting all file data into Delta Table from Azure Blob Container
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-25-2024 09:18 AM
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-25-2024 11:55 PM - edited 09-25-2024 11:56 PM
Hi @KristiLogos ,
Try first to add .trigger(availableNow=True). This ensures all the data is being processed.
Without the option, as per documentation, it will run the query as fast as possible, which is equivalent to setting the trigger to processingTime='0 seconds'.
When you're running the streaming query in a notebook where the cell execution might terminate before all data is processed, the query may not have enough time to ingest all your files. This could result in only a fraction of your data (e.g., 200 rows) being written to your Delta table.
df_autoloader.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", checkpoint_dir)
.trigger(availableNow=True)
.table("tablename")
Check this setting and let us know if it works.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-26-2024 03:26 AM
If it's streaming data, space it out with 10 seconds trigger
.trigger(processingTime="10 seconds")
Do all the JSON files have the same schema? As your table creation is dynamic (df.schema), if all JSON doesn't have the same schema they may be skipped.
~

