topic Re: Autoloader not ingesting all file data into Delta Table from Azure Blob Container in Data Engineering

Autoloader not ingesting all file data into Delta Table from Azure Blob Container

KristiLogos — Wed, 25 Sep 2024 16:18:10 GMT

I have done the following, ie. crate a Delta Table where I plan to load the Azure Blob Container files that are .json.gz files:

df = spark.read.option("multiline", "true").json(f"{container_location}/*.json.gz")

DeltaTable.create(spark) \

.addColumns(df.schema) \

.property("delta.minReaderVersion", "2") \

.property("delta.minWriterVersion", "5") \

.property("delta.columnMapping.mode", "name") \

.tableName('tablename') \

.execute()

Then I set up the autloader:

df_autoloader = (spark.readStream

.format("cloudFiles")

.option("cloudFiles.resourceGroup", "resourcename")

.option("cloudFiles.subscriptionId", "12345")

.option("cloudFiles.tenantId", "12345")

.option("cloudFiles.clientId", "12345")

.option("cloudFiles.clientSecret", "12345")

.option("cloudFiles.format", "json")

.option("multiline", "true")

.option("cloudFiles.useNotifications", "true")

.schema(schema)

.load(AMP_LOC) # path to Blob

)

(df_autoloader.writeStream

.format("delta")

.outputMode("append")

.option("checkpointLocation", checkpoint_dir)

.table("tablename")

)

I see things happenign in the cell but when I go to query the table I only see 200 rows of data, when there should be millions.

Re: Autoloader not ingesting all file data into Delta Table from Azure Blob Container

filipniziol — Thu, 26 Sep 2024 06:56:09 GMT

Hi @KristiLogos ,

Try first to add .trigger(availableNow=True). This ensures all the data is being processed.

Without the option, as per documentation, it will run the query as fast as possible, which is equivalent to setting the trigger to processingTime='0 seconds'.

When you're running the streaming query in a notebook where the cell execution might terminate before all data is processed, the query may not have enough time to ingest all your files. This could result in only a fraction of your data (e.g., 200 rows) being written to your Delta table.

df_autoloader.writeStream .format("delta") .outputMode("append") .option("checkpointLocation", checkpoint_dir) .trigger(availableNow=True) .table("tablename")

Check this setting and let us know if it works.

Re: Autoloader not ingesting all file data into Delta Table from Azure Blob Container

gchandra — Thu, 26 Sep 2024 10:26:25 GMT

If it's streaming data, space it out with 10 seconds trigger

.trigger(processingTime="10 seconds")

Do all the JSON files have the same schema? As your table creation is dynamic (df.schema), if all JSON doesn't have the same schema they may be skipped.