Autoloader not ingesting all file data into Delta Table from Azure Blob Container
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-25-2024 09:18 AM
I have done the following, ie. crate a Delta Table where I plan to load the Azure Blob Container files that are .json.gz files:
df = spark.read.option("multiline", "true").json(f"{container_location}/*.json.gz")
DeltaTable.create(spark) \
.addColumns(df.schema) \
.property("delta.minReaderVersion", "2") \
.property("delta.minWriterVersion", "5") \
.property("delta.columnMapping.mode", "name") \
.tableName('tablename') \
.execute()
Then I set up the autloader:
df_autoloader = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.resourceGroup", "resourcename")
.option("cloudFiles.subscriptionId", "12345")
.option("cloudFiles.tenantId", "12345")
.option("cloudFiles.clientId", "12345")
.option("cloudFiles.clientSecret", "12345")
.option("cloudFiles.format", "json")
.option("multiline", "true")
.option("cloudFiles.useNotifications", "true")
.schema(schema)
.load(AMP_LOC) # path to Blob
)
(df_autoloader.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", checkpoint_dir)
.table("tablename")
)
I see things happenign in the cell but when I go to query the table I only see 200 rows of data, when there should be millions.