Problem i am trying to solve:
Bronze is the landing zone for immutable, raw data.
At this stage, i am trying to sse a columnar format (Parquet or ORC) → good compression, efficient scans. and then apply lightweight compression (e.g., Snappy) → balances speed and size.
Data stored in Parquet or ORC with lightweight compression at the Bronze layer costs much less, is far more responsive for business queries, and lets organizations unlock value from vast volumes of raw data
Question:
- Like Kaggle are there any sources where i can get good quality ( unstructured, semi structured, structured combination ) of 100 Tb data
- Is reading the 100Tb data like below, the recommended best practice ?
# Step 1: Read raw data (CSV/JSON/Avro — update format as needed)
raw_df = (
spark.read.format("csv") # Change to "json" / "avro" if source differs
.option("header", "true") # Use header if CSV
.option("inferSchema", "true") # Infers schema (can be expensive for huge datasets)
.load("dbfs:/mnt/raw/huge_dataset/") # Path to raw 100TB dataset
)
# Step 2: Write into Bronze layer with Parquet + Snappy compression
(
raw_df.write.format("parquet")
.option("compression", "snappy") # Lightweight compression for Bronze
.mode("overwrite") # Overwrite Bronze zone if rerun
.save("dbfs:/mnt/bronze/huge_dataset/") # Bronze layer storage path
)