Autoloader not ingesting all file data into Delta ...

KristiLogos · ‎09-25-2024

I have done the following, ie. crate a Delta Table where I plan to load the Azure Blob Container files that are .json.gz files:

df = spark.read.option("multiline", "true").json(f"{container_location}/*.json.gz")

DeltaTable.create(spark) \

.addColumns(df.schema) \

.property("delta.minReaderVersion", "2") \

.property("delta.minWriterVersion", "5") \

.property("delta.columnMapping.mode", "name") \

.tableName('tablename') \

.execute()

Then I set up the autloader:

df_autoloader = (spark.readStream

.format("cloudFiles")

.option("cloudFiles.resourceGroup", "resourcename")

.option("cloudFiles.subscriptionId", "12345")

.option("cloudFiles.tenantId", "12345")

.option("cloudFiles.clientId", "12345")

.option("cloudFiles.clientSecret", "12345")

.option("cloudFiles.format", "json")

.option("multiline", "true")

.option("cloudFiles.useNotifications", "true")

.schema(schema)

.load(AMP_LOC) # path to Blob

)

(df_autoloader.writeStream

.format("delta")

.outputMode("append")

.option("checkpointLocation", checkpoint_dir)

.table("tablename")

)

I see things happenign in the cell but when I go to query the table I only see 200 rows of data, when there should be millions.

Autoloader not ingesting all file data into Delta Table from Azure Blob Container