Sunday
Problem i am trying to solve:
Bronze is the landing zone for immutable, raw data.
At this stage, i am trying to sse a columnar format (Parquet or ORC) → good compression, efficient scans. and then apply lightweight compression (e.g., Snappy) → balances speed and size.
Data stored in Parquet or ORC with lightweight compression at the Bronze layer costs much less, is far more responsive for business queries, and lets organizations unlock value from vast volumes of raw data
Question:
# Step 1: Read raw data (CSV/JSON/Avro — update format as needed)
raw_df = (
spark.read.format("csv") # Change to "json" / "avro" if source differs
.option("header", "true") # Use header if CSV
.option("inferSchema", "true") # Infers schema (can be expensive for huge datasets)
.load("dbfs:/mnt/raw/huge_dataset/") # Path to raw 100TB dataset
)
# Step 2: Write into Bronze layer with Parquet + Snappy compression
(
raw_df.write.format("parquet")
.option("compression", "snappy") # Lightweight compression for Bronze
.mode("overwrite") # Overwrite Bronze zone if rerun
.save("dbfs:/mnt/bronze/huge_dataset/") # Bronze layer storage path
)
Sunday
hey @ManojkMohan
Great questions!
On the first question around getting large example datasets, I'd recommend the following places:
At least the last time I had a dig into these, it was possible to get very large datasets from them. The other shout could be government open data portals.
On the second question, your direction is correct, but at 100TB I'd really recommend not inferring the schema if possible (this is very computationally expensive on data of that size). You should also consider not trying to read the entire 100TB at once, and break it into smaller incremental ingests. I would recommend taking a look at Auto Loader which processes files incrementally and keeps track of what it has already ingested. This will be much more robust and reliable if its possible to be used in your use-case.
yesterday
To add on this:
writing this data to parquet is not the issue here.
Just make sure the csv file is stored on a hdfs-enabled storage.
The hard part is making queries ion this parquet-data reasonably fast. So you will probably need some performance tuning and besides bucketing/partitioning you are kinda limited in parquet.
Delta lake/Iceberg or Databricks managed tables with predictive optimization might be better choices here.
There is also parquet v2, which is not enabled by default, you might wanna look into (it has better compression). The latter however I have not yet tested myself.
yesterday - last edited yesterday
Hi @ManojkMohan ,
To jump in to conversion, is there any particular reason why you don't want to load that csv to Delta format? Delta has multiple advantages over reqular parquet.
Things like file skipping, predicate pushdown filtering are much more performant on delta. On delta you can apply z-ordering, liquid clustering etc and databricks can do some cool thing for you like predictive optimization if you use delta format.
yesterday
My recommendation is to use delta tables with liquid clustering and a separate cloud storage for each bronze, silver or gold layer in your medallion architecture. Besides, schedule a periodic job to optimize deltas and vacuum "obsolete" parquet files. As from here, check how performance looks and try to fine tune.
yesterday
@Coffee77 I'm curious why you'd want a separate cloud storage for each layer in the medallion architecture? I'd have thought that overcomplicates things. What exactly do you mean by "separate cloud storage", I think I'm misunderstanding 🤔
All the best,
BS
yesterday
It is a suggestion based on the chosen architecture in the main Databricks application I am currently working. Let me explain in detail:
- One workspace per environment, all of them governed by Unity Catalog with two metastores.
- Each workspace contains three different catalogs, to cover "bronze", "silver" and "gold" layers. All of them using delta tables.
- Discussing with our main cloud architect (MVP Microsoft), we came to the conclusion that placing each catalog in a dedicated Azure Datalake Storage Account (ADLS Gen2) would be much better for performance than having only one ADLS Gen2 with separated containers per environment or similar. Latter case, thoughput would be shared by all layers in medallion architecture. So, with dedicated layers, performance is much better in I/O operations.
- so, each layer has a related catalog and each catalog is physically placed ina different Azure Storage account.
I hope this helps.
yesterday - last edited yesterday
@Coffee77 From the docs, I thought it was 1 metastore per region and 1 Unity Catalog mapped to that:https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices#meta...
You then have as many workspaces as needed within that region. Each workspace contains the 3-level namespace, catalog->schema->table. You can have as many catalogs (top level-name space) as you require. You can host your bronze silver gold within a single workspace, i.e. workspace called "dev". Perhaps you guys already know about this region constraint?
I think Unity Catalog abstracts the storage part away from us. Essentially, it sits on top of a single ADLS2, as you mention. Databricks nicely handles the storage under-the-hood for us i.e containers.
I think the time you'd need to consider additional ADLS in when we exceed things like this:
https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices#mana...
That is a really interesting point and one I wouldn't have considered without you raising it. I do fear it's easy to overcomplicate a design, though. Thanks for bringing this type of thing to my attention @Coffee77
I must say, I'm pleasantly surprised to find these finer details in the documentation provided by Databricks.
All the best,
BS
yesterday
I work in a very large organization that has an special agreement with Databricks to be able to create multiple metastores per region. This is cool to not include all of our heterogeneous applications in same metastore.
On the other hand, we have a metastore managing dev, qa, stage and prod. As said, each environment with three catalogs placed into dedicated storage accounts but using unity catalog to manage everything (connections, etc.). The other metastore in a different region is for DR (Disaster Recovery) cross-region purposes, with an additional BCP ( Business Continuous Plan) environment...
Security and performance requirements are very strict in our case so that, we need to struggle to think whatever possible technique!
yesterday
@Coffee77 that's awesome, I didn't even know that was possible. Every day is a school day 🤓. That setup looks exciting...😂 Thanks for sharing! 👏.
All the best,
BS
yesterday
Same for me, lifelong learner 🙂
yesterday
At all thanks for all your suggestions , trying the optimal next steps based on these responses, will have an update here with screen shots soon
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now