Hi,Option 2 is should be avoided.The real decision is between Option 1 (simpler) and Option 3 (best practice).Why OPTION 2 is a NO GO:This violates separation of concerns: Mixes governance layer (catalog storage) with data layer (bronze) Harder to m...
Hi ,The extra rows could have been caused by various reasons:Extra files in the directoryEmpty or corrupt recordsNon-JSON content being picked up on the first runYou could make sure that your input path contains only valid JSON files or you could mod...
Remove records using the DELETE operation in both Bronze & Silver tables.After doing each delete step, you can Optimize the table which rewrites the parquet files for that table behind the scenes to improve the data layout (Read more about optimize h...
Delta Lake always creates a new version of parquet files whenever any operation is performed. In order to have a better performance, you can Optimize the table which rewrites the parquet files for that table behind the scenes to improve the data layo...
First, you can read the ZIP file in a binary format [ spark.read.format("binaryFile") ], then use the zipfile Python package to unzip and extract all the files from the zipped file and store them in a Volume.