Databricks Community

ManojkMohan · ‎08-31-2025

Problem i am trying to solve:

Bronze is the landing zone for immutable, raw data.

At this stage, i am trying to sse a columnar format (Parquet or ORC) → good compression, efficient scans. and then apply lightweight compression (e.g., Snappy) → balances speed and size.

Data stored in Parquet or ORC with lightweight compression at the Bronze layer costs much less, is far more responsive for business queries, and lets organizations unlock value from vast volumes of raw data

Question:

Like Kaggle are there any sources where i can get good quality ( unstructured, semi structured, structured combination ) of 100 Tb data
Is reading the 100Tb data like below, the recommended best practice ?

# Step 1: Read raw data (CSV/JSON/Avro — update format as needed)
raw_df = (
spark.read.format("csv") # Change to "json" / "avro" if source differs
.option("header", "true") # Use header if CSV
.option("inferSchema", "true") # Infers schema (can be expensive for huge datasets)
.load("dbfs:/mnt/raw/huge_dataset/") # Path to raw 100TB dataset
)

# Step 2: Write into Bronze layer with Parquet + Snappy compression
(
raw_df.write.format("parquet")
.option("compression", "snappy") # Lightweight compression for Bronze
.mode("overwrite") # Overwrite Bronze zone if rerun
.save("dbfs:/mnt/bronze/huge_dataset/") # Bronze layer storage path
)

ManojkMohan · ‎09-04-2025

@szymon_dybczak @BS_THE_ANALYST @Coffee77 @TheOC the use case summary is as eblow

The use case:

A telecom operator wants to minimize unnecessary truck rolls (sending technicians to customer sites), which cost $100–$200 per visit.

Data sources feeding into the data platform:

Network telemetry – SNMP traps, modem/router health (e.g., SNR, packet loss, outages).
IoT device data – ONT, set-top boxes, CPE logs.
CRM & Billing data – open tickets, service type, SLA tiers.
Geospatial/weather feeds – storm events, regional outages.
Technician logs – prior visit outcomes.
All this lands in the Bronze layer as unstructured JSON, CSV, log files, and streaming events.

Why Parquet in Silver Layer?

The Silver layer aggregates and cleans this into a customer/equipment-level service health dataset:

Customer ID, Service ID, Site ID
Last 24h modem health KPIs (uptime, SNR, packet loss)
Outage correlation (area-wide vs. local issue)
Historical technician visits and resolution codes
Predictive probability: "Can this issue be fixed remotely?"

Benefits of Parquet here which i am tyring to achieeve:

Efficient analytics – Technicians need KPIs by device or site; Parquet’s columnar format makes queries 5–10× faster.
Compression – IoT + network telemetry is massive; Parquet reduces footprint dramatically.
Schema evolution – New device types (5G routers, IoT sensors) can be added without breaking downstream integrations.
Reusability – Same Parquet data powers ML models (predicting if a truck roll is necessary) and operational dashboards.
But you have a very valid suggestion i am trying On ingest, also write to a Delta table with minimal transformation. This becomes the query-friendly version of raw.

Databricks Community

Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples