cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy

ManojkMohan
Valued Contributor III

Problem i am trying to solve:

Bronze is the landing zone for immutable, raw data.

At this stage, i am trying to sse a columnar format (Parquet or ORC) → good compression, efficient scans. and then apply lightweight compression (e.g., Snappy) → balances speed and size.

Data stored in Parquet or ORC with lightweight compression at the Bronze layer costs much less, is far more responsive for business queries, and lets organizations unlock value from vast volumes of raw data

Question:

  1. Like Kaggle are there any sources where i can get good quality ( unstructured, semi structured, structured combination ) of 100 Tb data 
  2. Is reading the 100Tb data like below, the recommended best practice ?

 

# Step 1: Read raw data (CSV/JSON/Avro — update format as needed)
raw_df = (
spark.read.format("csv") # Change to "json" / "avro" if source differs
.option("header", "true") # Use header if CSV
.option("inferSchema", "true") # Infers schema (can be expensive for huge datasets)
.load("dbfs:/mnt/raw/huge_dataset/") # Path to raw 100TB dataset
)

# Step 2: Write into Bronze layer with Parquet + Snappy compression
(
raw_df.write.format("parquet")
.option("compression", "snappy") # Lightweight compression for Bronze
.mode("overwrite") # Overwrite Bronze zone if rerun
.save("dbfs:/mnt/bronze/huge_dataset/") # Bronze layer storage path
)

15 REPLIES 15

ManojkMohan
Valued Contributor III

@szymon_dybczak @BS_THE_ANALYST @Coffee77 @TheOC  the use case summary is as eblow 

The use case: 

A telecom operator wants to minimize unnecessary truck rolls (sending technicians to customer sites), which cost $100–$200 per visit.

Data sources feeding into the data platform:

Network telemetry – SNMP traps, modem/router health (e.g., SNR, packet loss, outages).
IoT device data – ONT, set-top boxes, CPE logs.
CRM & Billing data – open tickets, service type, SLA tiers.
Geospatial/weather feeds – storm events, regional outages.
Technician logs – prior visit outcomes.
All this lands in the Bronze layer as unstructured JSON, CSV, log files, and streaming events.

Why Parquet in Silver Layer?

The Silver layer aggregates and cleans this into a customer/equipment-level service health dataset:

Customer ID, Service ID, Site ID
Last 24h modem health KPIs (uptime, SNR, packet loss)
Outage correlation (area-wide vs. local issue)
Historical technician visits and resolution codes
Predictive probability: "Can this issue be fixed remotely?"


Benefits of Parquet here which i am tyring to achieeve:

 

Efficient analytics – Technicians need KPIs by device or site; Parquet’s columnar format makes queries 5–10× faster.
Compression – IoT + network telemetry is massive; Parquet reduces footprint dramatically.
Schema evolution – New device types (5G routers, IoT sensors) can be added without breaking downstream integrations.
Reusability – Same Parquet data powers ML models (predicting if a truck roll is necessary) and operational dashboards.
But you have a very valid suggestion i am trying On ingest, also write to a Delta table with minimal transformation. This becomes the query-friendly version of raw.