Databricks Community

KristiLogos · ‎09-25-2024

I have done the following, ie. crate a Delta Table where I plan to load the Azure Blob Container files that are .json.gz files:

df = spark.read.option("multiline", "true").json(f"{container_location}/*.json.gz")

DeltaTable.create(spark) \

.addColumns(df.schema) \

.property("delta.minReaderVersion", "2") \

.property("delta.minWriterVersion", "5") \

.property("delta.columnMapping.mode", "name") \

.tableName('tablename') \

.execute()

Then I set up the autloader:

df_autoloader = (spark.readStream

.format("cloudFiles")

.option("cloudFiles.resourceGroup", "resourcename")

.option("cloudFiles.subscriptionId", "12345")

.option("cloudFiles.tenantId", "12345")

.option("cloudFiles.clientId", "12345")

.option("cloudFiles.clientSecret", "12345")

.option("cloudFiles.format", "json")

.option("multiline", "true")

.option("cloudFiles.useNotifications", "true")

.schema(schema)

.load(AMP_LOC) # path to Blob

)

(df_autoloader.writeStream

.format("delta")

.outputMode("append")

.option("checkpointLocation", checkpoint_dir)

.table("tablename")

)

I see things happenign in the cell but when I go to query the table I only see 200 rows of data, when there should be millions.

filipniziol · ‎09-25-2024

Hi @KristiLogos ,

Try first to add .trigger(availableNow=True). This ensures all the data is being processed.

Without the option, as per documentation, it will run the query as fast as possible, which is equivalent to setting the trigger to processingTime='0 seconds'.

When you're running the streaming query in a notebook where the cell execution might terminate before all data is processed, the query may not have enough time to ingest all your files. This could result in only a fraction of your data (e.g., 200 rows) being written to your Delta table.

df_autoloader.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", checkpoint_dir)  
    .trigger(availableNow=True)
    .table("tablename")

Check this setting and let us know if it works.

gchandra · ‎09-26-2024

If it's streaming data, space it out with 10 seconds trigger

.trigger(processingTime="10 seconds")

Do all the JSON files have the same schema? As your table creation is dynamic (df.schema), if all JSON doesn't have the same schema they may be skipped.

~

Databricks Community

Autoloader not ingesting all file data into Delta Table from Azure Blob Container

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

How to present and share your Notebook insights in AI/BI Dashboards

Introducing an exclusively Databricks-hosted Assistant

Meet the Databricks MVPs

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs