cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy

ManojkMohan
Contributor III

Problem i am trying to solve:

Bronze is the landing zone for immutable, raw data.

At this stage, i am trying to sse a columnar format (Parquet or ORC) → good compression, efficient scans. and then apply lightweight compression (e.g., Snappy) → balances speed and size.

Data stored in Parquet or ORC with lightweight compression at the Bronze layer costs much less, is far more responsive for business queries, and lets organizations unlock value from vast volumes of raw data

Question:

  1. Like Kaggle are there any sources where i can get good quality ( unstructured, semi structured, structured combination ) of 100 Tb data 
  2. Is reading the 100Tb data like below, the recommended best practice ?

 

# Step 1: Read raw data (CSV/JSON/Avro — update format as needed)
raw_df = (
spark.read.format("csv") # Change to "json" / "avro" if source differs
.option("header", "true") # Use header if CSV
.option("inferSchema", "true") # Infers schema (can be expensive for huge datasets)
.load("dbfs:/mnt/raw/huge_dataset/") # Path to raw 100TB dataset
)

# Step 2: Write into Bronze layer with Parquet + Snappy compression
(
raw_df.write.format("parquet")
.option("compression", "snappy") # Lightweight compression for Bronze
.mode("overwrite") # Overwrite Bronze zone if rerun
.save("dbfs:/mnt/bronze/huge_dataset/") # Bronze layer storage path
)

1 REPLY 1

TheOC
Contributor

hey @ManojkMohan 
Great questions!

On the first question around getting large example datasets, I'd recommend the following places:

  1. AWS Registry of Open data (https://registry.opendata.aws/)
  2. Google Cloud BigQuery Public Datasets (https://cloud.google.com/bigquery/public-data)

At least the last time I had a dig into these, it was possible to get very large datasets from them. The other shout could be government open data portals.

On the second question, your direction is correct, but at 100TB I'd really recommend not inferring the schema if possible (this is very computationally expensive on data of that size). You should also consider not trying to read the entire 100TB at once, and break it into smaller incremental ingests. I would recommend taking a look at Auto Loader which processes files incrementally and keeps track of what it has already ingested. This will be much more robust and reliable if its possible to be used in your use-case.

Cheers,
TheOC

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now