Databricks Community

ManojkMohan · ‎08-31-2025

Problem i am trying to solve:

Bronze is the landing zone for immutable, raw data.

At this stage, i am trying to sse a columnar format (Parquet or ORC) → good compression, efficient scans. and then apply lightweight compression (e.g., Snappy) → balances speed and size.

Data stored in Parquet or ORC with lightweight compression at the Bronze layer costs much less, is far more responsive for business queries, and lets organizations unlock value from vast volumes of raw data

Question:

Like Kaggle are there any sources where i can get good quality ( unstructured, semi structured, structured combination ) of 100 Tb data
Is reading the 100Tb data like below, the recommended best practice ?

# Step 1: Read raw data (CSV/JSON/Avro — update format as needed)
raw_df = (
spark.read.format("csv") # Change to "json" / "avro" if source differs
.option("header", "true") # Use header if CSV
.option("inferSchema", "true") # Infers schema (can be expensive for huge datasets)
.load("dbfs:/mnt/raw/huge_dataset/") # Path to raw 100TB dataset
)

# Step 2: Write into Bronze layer with Parquet + Snappy compression
(
raw_df.write.format("parquet")
.option("compression", "snappy") # Lightweight compression for Bronze
.mode("overwrite") # Overwrite Bronze zone if rerun
.save("dbfs:/mnt/bronze/huge_dataset/") # Bronze layer storage path
)

TheOC · ‎08-31-2025

hey @ManojkMohan
Great questions!

On the first question around getting large example datasets, I'd recommend the following places:

AWS Registry of Open data (https://registry.opendata.aws/)
Google Cloud BigQuery Public Datasets (https://cloud.google.com/bigquery/public-data)

At least the last time I had a dig into these, it was possible to get very large datasets from them. The other shout could be government open data portals.

On the second question, your direction is correct, but at 100TB I'd really recommend not inferring the schema if possible (this is very computationally expensive on data of that size). You should also consider not trying to read the entire 100TB at once, and break it into smaller incremental ingests. I would recommend taking a look at Auto Loader which processes files incrementally and keeps track of what it has already ingested. This will be much more robust and reliable if its possible to be used in your use-case.

Cheers,
TheOC

View solution in original post

TheOC · ‎08-31-2025

hey @ManojkMohan
Great questions!

On the first question around getting large example datasets, I'd recommend the following places:

AWS Registry of Open data (https://registry.opendata.aws/)
Google Cloud BigQuery Public Datasets (https://cloud.google.com/bigquery/public-data)

At least the last time I had a dig into these, it was possible to get very large datasets from them. The other shout could be government open data portals.

On the second question, your direction is correct, but at 100TB I'd really recommend not inferring the schema if possible (this is very computationally expensive on data of that size). You should also consider not trying to read the entire 100TB at once, and break it into smaller incremental ingests. I would recommend taking a look at Auto Loader which processes files incrementally and keeps track of what it has already ingested. This will be much more robust and reliable if its possible to be used in your use-case.

Cheers,
TheOC

-werners- · ‎09-01-2025

To add on this:
writing this data to parquet is not the issue here.
Just make sure the csv file is stored on a hdfs-enabled storage.
The hard part is making queries ion this parquet-data reasonably fast. So you will probably need some performance tuning and besides bucketing/partitioning you are kinda limited in parquet.
Delta lake/Iceberg or Databricks managed tables with predictive optimization might be better choices here.
There is also parquet v2, which is not enabled by default, you might wanna look into (it has better compression). The latter however I have not yet tested myself.

szymon_dybczak · ‎09-01-2025

Hi @ManojkMohan ,

To jump in to conversion, is there any particular reason why you don't want to load that csv to Delta format? Delta has multiple advantages over reqular parquet.
Things like file skipping, predicate pushdown filtering are much more performant on delta. On delta you can apply z-ordering, liquid clustering etc and databricks can do some cool thing for you like predictive optimization if you use delta format.

ManojkMohan · ‎09-04-2025

The use case:

A telecom operator wants to minimize unnecessary truck rolls (sending technicians to customer sites), which cost $100–$200 per visit.

Data sources feeding into the data platform:

Network telemetry – SNMP traps, modem/router health (e.g., SNR, packet loss, outages).
IoT device data – ONT, set-top boxes, CPE logs.
CRM & Billing data – open tickets, service type, SLA tiers.
Geospatial/weather feeds – storm events, regional outages.
Technician logs – prior visit outcomes.
All this lands in the Bronze layer as unstructured JSON, CSV, log files, and streaming events.

Why Parquet in Silver Layer?

The Silver layer aggregates and cleans this into a customer/equipment-level service health dataset:

Customer ID, Service ID, Site ID
Last 24h modem health KPIs (uptime, SNR, packet loss)
Outage correlation (area-wide vs. local issue)
Historical technician visits and resolution codes
Predictive probability: "Can this issue be fixed remotely?"

Benefits of Parquet here which i am tyring to achieeve:

Efficient analytics – Technicians need KPIs by device or site; Parquet’s columnar format makes queries 5–10× faster.
Compression – IoT + network telemetry is massive; Parquet reduces footprint dramatically.
Schema evolution – New device types (5G routers, IoT sensors) can be added without breaking downstream integrations.
Reusability – Same Parquet data powers ML models (predicting if a truck roll is necessary) and operational dashboards.
But you have a very valid suggestion i am trying On ingest, also write to a Delta table with minimal transformation. This becomes the query-friendly version of raw.

Coffee77 · ‎09-01-2025

My recommendation is to use delta tables with liquid clustering and a separate cloud storage for each bronze, silver or gold layer in your medallion architecture. Besides, schedule a periodic job to optimize deltas and vacuum "obsolete" parquet files. As from here, check how performance looks and try to fine tune.

https://www.youtube.com/@CafeConData

BS_THE_ANALYST · ‎09-01-2025

@Coffee77 I'm curious why you'd want a separate cloud storage for each layer in the medallion architecture? I'd have thought that overcomplicates things. What exactly do you mean by "separate cloud storage", I think I'm misunderstanding 🤔

All the best,
BS

Coffee77 · ‎09-01-2025

It is a suggestion based on the chosen architecture in the main Databricks application I am currently working. Let me explain in detail:

- One workspace per environment, all of them governed by Unity Catalog with two metastores.

- Each workspace contains three different catalogs, to cover "bronze", "silver" and "gold" layers. All of them using delta tables.

- Discussing with our main cloud architect (MVP Microsoft), we came to the conclusion that placing each catalog in a dedicated Azure Datalake Storage Account (ADLS Gen2) would be much better for performance than having only one ADLS Gen2 with separated containers per environment or similar. Latter case, thoughput would be shared by all layers in medallion architecture. So, with dedicated layers, performance is much better in I/O operations.

- so, each layer has a related catalog and each catalog is physically placed ina different Azure Storage account.

I hope this helps.

https://www.youtube.com/@CafeConData

BS_THE_ANALYST · ‎09-01-2025

@Coffee77 From the docs, I thought it was 1 metastore per region and 1 Unity Catalog mapped to that:https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices#meta...

You then have as many workspaces as needed within that region. Each workspace contains the 3-level namespace, catalog->schema->table. You can have as many catalogs (top level-name space) as you require. You can host your bronze silver gold within a single workspace, i.e. workspace called "dev". Perhaps you guys already know about this region constraint?

I think Unity Catalog abstracts the storage part away from us. Essentially, it sits on top of a single ADLS2, as you mention. Databricks nicely handles the storage under-the-hood for us i.e containers.

I think the time you'd need to consider additional ADLS in when we exceed things like this:

https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices#mana...

That is a really interesting point and one I wouldn't have considered without you raising it. I do fear it's easy to overcomplicate a design, though. Thanks for bringing this type of thing to my attention @Coffee77

I must say, I'm pleasantly surprised to find these finer details in the documentation provided by Databricks.

All the best,
BS

Coffee77 · ‎09-01-2025

I work in a very large organization that has an special agreement with Databricks to be able to create multiple metastores per region. This is cool to not include all of our heterogeneous applications in same metastore.

On the other hand, we have a metastore managing dev, qa, stage and prod. As said, each environment with three catalogs placed into dedicated storage accounts but using unity catalog to manage everything (connections, etc.). The other metastore in a different region is for DR (Disaster Recovery) cross-region purposes, with an additional BCP ( Business Continuous Plan) environment...

Security and performance requirements are very strict in our case so that, we need to struggle to think whatever possible technique!

https://www.youtube.com/@CafeConData

BS_THE_ANALYST · ‎09-01-2025

@Coffee77 that's awesome, I didn't even know that was possible. Every day is a school day 🤓. That setup looks exciting...😂 Thanks for sharing! 👏.

All the best,
BS

Coffee77 · ‎09-01-2025

Same for me, lifelong learner 🙂

https://www.youtube.com/@CafeConData

ManojkMohan · ‎09-01-2025

At all thanks for all your suggestions , trying the optimal next steps based on these responses, will have an update here with screen shots soon

ManojkMohan · ‎09-02-2025

@TheOC @szymon_dybczak @BS_THE_ANALYST @Coffee77 the latest on this

Batch Ingestion (10 TB Synthetic Data)

Data Generation

Spark generates 100 billion synthetic rows (~10 TB).
Columns: id, random_val, category, payload.
Partitioned across 10,000 partitions for parallelism.

Data Storage

Uses Databricks Unity Catalog: Catalog = 10tb, Schema = bronze.
Data is written as a managed Delta table: bronze.synthetic_10tb.
This is batch ingestion — one-time write of massive dataset

Streaming Ingestion (~500 MB/day)

Data Generation

Spark structured streaming using rate source (~50 rows/sec → 500 MB/day).
Columns: id, random_val, category, payload.

Checkpointing

Required for exactly-once guarantees.
Stored on S3 bucket: s3://streamingdataproto735/checkpoints/synthetic_500mb_continuous.

Data Storage

Appends continuously to the same Bronze Delta table: 10tb.bronze.synthetic_10tb.
Supports indefinite streaming ingestion with micro-batches (availableNow or timed trigger).
currently resolving error of AWS creds config . Request your thoughts in parallel ? WIll summarise all learnings and publish a knowledge article on collective learnings

BS_THE_ANALYST · ‎09-03-2025

@ManojkMohan You could just do the checkpointing inside a volume within Unity Catalog. What's the benefit to having this externally?

I think resolving the AWS creds config is still good for learning but you can bypass that.

All the best,
BS