cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy

ManojkMohan
Contributor III

Problem i am trying to solve:

Bronze is the landing zone for immutable, raw data.

At this stage, i am trying to sse a columnar format (Parquet or ORC) → good compression, efficient scans. and then apply lightweight compression (e.g., Snappy) → balances speed and size.

Data stored in Parquet or ORC with lightweight compression at the Bronze layer costs much less, is far more responsive for business queries, and lets organizations unlock value from vast volumes of raw data

Question:

  1. Like Kaggle are there any sources where i can get good quality ( unstructured, semi structured, structured combination ) of 100 Tb data 
  2. Is reading the 100Tb data like below, the recommended best practice ?

 

# Step 1: Read raw data (CSV/JSON/Avro — update format as needed)
raw_df = (
spark.read.format("csv") # Change to "json" / "avro" if source differs
.option("header", "true") # Use header if CSV
.option("inferSchema", "true") # Infers schema (can be expensive for huge datasets)
.load("dbfs:/mnt/raw/huge_dataset/") # Path to raw 100TB dataset
)

# Step 2: Write into Bronze layer with Parquet + Snappy compression
(
raw_df.write.format("parquet")
.option("compression", "snappy") # Lightweight compression for Bronze
.mode("overwrite") # Overwrite Bronze zone if rerun
.save("dbfs:/mnt/bronze/huge_dataset/") # Bronze layer storage path
)

11 REPLIES 11

TheOC
Contributor

hey @ManojkMohan 
Great questions!

On the first question around getting large example datasets, I'd recommend the following places:

  1. AWS Registry of Open data (https://registry.opendata.aws/)
  2. Google Cloud BigQuery Public Datasets (https://cloud.google.com/bigquery/public-data)

At least the last time I had a dig into these, it was possible to get very large datasets from them. The other shout could be government open data portals.

On the second question, your direction is correct, but at 100TB I'd really recommend not inferring the schema if possible (this is very computationally expensive on data of that size). You should also consider not trying to read the entire 100TB at once, and break it into smaller incremental ingests. I would recommend taking a look at Auto Loader which processes files incrementally and keeps track of what it has already ingested. This will be much more robust and reliable if its possible to be used in your use-case.

Cheers,
TheOC

-werners-
Esteemed Contributor III

To add on this:
writing this data to parquet is not the issue here.
Just make sure the csv file is stored on a hdfs-enabled storage.
The hard part is making queries ion this parquet-data reasonably fast.  So you will probably need some performance tuning and besides bucketing/partitioning you are kinda limited in parquet.
Delta lake/Iceberg or Databricks managed tables with predictive optimization might be better choices here.
There is also parquet v2, which is not enabled by default, you might wanna look into (it has better compression).  The latter however I have not yet tested myself.

szymon_dybczak
Esteemed Contributor III

Hi @ManojkMohan ,

To jump in to conversion, is there any particular reason why you don't want to load that csv to Delta format? Delta has multiple advantages over reqular parquet.
Things like file skipping, predicate pushdown filtering are much more performant on delta. On delta you can apply z-ordering, liquid clustering etc and databricks can do some cool thing for you like predictive optimization if you use delta format.

Coffee77
New Contributor II

My recommendation is to use delta tables with liquid clustering and a separate cloud storage for each bronze, silver or gold layer in your medallion architecture. Besides, schedule a periodic job to optimize deltas and vacuum "obsolete" parquet files. As from here, check how performance looks and try to fine tune.

https://www.youtube.com/@CafeConData

BS_THE_ANALYST
Honored Contributor III

@Coffee77 I'm curious why you'd want a separate cloud storage for each layer in the medallion architecture? I'd have thought that overcomplicates things. What exactly do you mean by "separate cloud storage", I think I'm misunderstanding 🤔

All the best,
BS

It is a suggestion based on the chosen architecture in the main Databricks application I am currently working. Let me explain in detail:

- One workspace per environment, all of them governed by Unity Catalog with two metastores.

- Each workspace contains three different catalogs, to cover "bronze", "silver" and "gold" layers. All of them using delta tables.

- Discussing with our main cloud architect (MVP Microsoft), we came to the conclusion that placing each catalog in a dedicated Azure Datalake Storage Account (ADLS Gen2) would be much better for performance than having only one ADLS Gen2 with separated containers per environment or similar. Latter case, thoughput would be shared by all layers in medallion architecture. So, with dedicated layers, performance is much better in I/O operations.

- so, each layer has a related catalog and each catalog is physically placed ina different Azure Storage account.

I hope this helps.

https://www.youtube.com/@CafeConData

BS_THE_ANALYST
Honored Contributor III

@Coffee77  From the docs, I thought it was 1 metastore per region and 1 Unity Catalog mapped to that:https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices#meta... 

You then have as many workspaces as needed within that region. Each workspace contains the 3-level namespace, catalog->schema->table. You can have as many catalogs (top level-name space) as you require. You can host your bronze silver gold within a single workspace, i.e. workspace called "dev".  Perhaps you guys already know about this region constraint?

I think Unity Catalog abstracts the storage part away from us. Essentially, it sits on top of a single ADLS2, as you mention. Databricks nicely handles the storage under-the-hood for us i.e containers.

I think the time you'd need to consider additional ADLS in when we exceed things like this:

BS_THE_ANALYST_0-1756754789804.png

https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices#mana... 

That is a really interesting point and one I wouldn't have considered without you raising it. I do fear it's easy to overcomplicate a design, though. Thanks for bringing this type of thing to my attention @Coffee77 

I must say, I'm pleasantly surprised to find these finer details in the documentation provided by Databricks.

All the best,
BS

I work in a very large organization that has an special agreement with Databricks to be able to create multiple metastores per region. This is cool to not include all of our heterogeneous applications in same metastore.

On the other hand, we have a metastore managing dev, qa, stage and prod. As said, each environment with three catalogs placed into dedicated storage accounts but using unity catalog to manage everything (connections, etc.). The other metastore in a different region is for DR (Disaster Recovery) cross-region purposes, with an additional BCP ( Business Continuous Plan) environment...

Security and performance requirements are very strict in our case so that, we need to struggle to think whatever possible technique!

https://www.youtube.com/@CafeConData

BS_THE_ANALYST
Honored Contributor III

@Coffee77 that's awesome, I didn't even know that was possible. Every day is a school day 🤓. That setup looks exciting...😂 Thanks for sharing! 👏

All the best,
BS

Same for me, lifelong learner 🙂

https://www.youtube.com/@CafeConData

ManojkMohan
Contributor III

At all thanks for all your suggestions , trying the optimal next steps based on these responses, will have an update here with screen shots soon