Databricks

SudiptaBiswas · ‎12-21-2022

I have a databricks autoloader notebook that reads json files from an input location and writes the flattened version of json files to an output location. However, the notebook is behaving differently for two different but similar scenarios as described below.

Any help is appreciated.

Autoloader functionality - flattening of JSON files

Scenario 1:

Step a) At start: 'Input location 1' has 3 non-zero size json files (File1.json, File2.json, File3.json) and 24 zero size json files. 'Input location 1' has a total of 27 json files.

Autoloader notebook is started. Autoloader notebook reads the 3 non-zero size json files (File1.json, File2.json, File3.json). Autoloader notebook properly flattens the 3 non-zero size json files (File1.json, File2.json, File3.json) and writes to 'Output location 1'.

Step b) after this 3 more non-zero size json files (File4.json, File5.json, File6.json) are added to 'Input location 1' while the Autoloader notebook is running. 'Input location 1' now has a total of 30 json files. Autoloader notebook reads the 3 additional non-zero size json files (File4.json, File5.json, File6.json), properly flattens the 3 additional non-zero size json files (File4.json, File5.json, File6.json) and writes to 'Output location 1'.

'Output location 1' contains records pertaining to 6 non-zero size json files (File1.json, File2.json, File3.json and File4.json, File5.json, File6.json)

No problem with Scenario 1. The problem arises with step b of Scenario 2 (given below). Scenario 1 and Scenario 2 are similar.

Scenario 2:

Step a) At start: 'Input location 2' has 65 non-zero size json files (File1.json, File2.json, .........,File65.json) and 24 zero size json files. 'Input location 2' has a total of 89 json files.

Autoloader notebook is started. Autoloader notebook reads the 65 non-zero size json files (File1.json, File2.json, .........,File65.json). Autoloader notebook properly flattens the 65 non-zero size json files (File1.json, File2.json, .........,File65.json) and writes to 'Output location 2'.

Step b) after this 3 more non-zero size json files (File4.json, File5.json, File6.json) are added to 'Input location 2' while the Autoloader notebook is running. 'Input location 2' now has a total of 92 json files. Autoloader notebook reads the 3 additional non-zero size json files (File4.json, File5.json, File6.json) but doesn't write the flattened output of the 3 additional non-zero size json files (File4.json, File5.json, File6.json) to 'Output location 2'.

'Output location 2' contains records pertaining to 65 non-zero size json files (File1.json, File2.json, .........,File65.json) but doesn't contain records for 3 additional non-zero size json files (File4.json, File5.json, File6.json) added in step b of 'Scenario2'

Question:

Can anyone please provide any direction to solve this issue - why the Autoloader notebook cannot flatten and write the the 3 additional non-zero size json files (File4.json, File5.json, File6.json) in step b of 'Scenario2' but the same Autoloader notebook can flatten and write the the 3 additional non-zero size json files (File4.json, File5.json, File6.json) in step b of 'Scenario1'

Any help is appreciated.

Note:All non-zero json files are less than 25KB. autoloader notebook reads/senses input files using 'fileNotification' method [option("cloudFiles.useNotifications","true")].

Thanks,

Sudipta.

SudiptaBiswas · ‎12-22-2022

Can anyone please provide any suggestions ?

#[Azure databricks] , #[Databricks autoloader] , #Autoloader

jose_gonzalez · ‎12-27-2022

Could you provide a code snippet? also do you see any error logs in the driver logs?

SudiptaBiswas · ‎12-27-2022

Thanks for your reply. I am sorry I cannot provide the code snippet.

If there had been any error (driver or other error) then the autoloader notebook would have got stopped which didn't happen in this case. Please correct me if I am wrong.

The autoloader notebook continued running in step b of 'Scenario2' since it was started (in step a of 'Scenario2')

Databricks

databricks autoloader getting stuck in flattening json files for different scenarios similar in nature.

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI