<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130290#M48743</link>
    <description>&lt;P&gt;&lt;STRONG&gt;Problem i am trying to solve:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Bronze is the landing zone for immutable, raw data.&lt;/P&gt;&lt;P&gt;At this stage, i am trying to sse a columnar format (Parquet or ORC) → good compression, efficient scans. and then apply lightweight compression (e.g., Snappy) → balances speed and size.&lt;/P&gt;&lt;P&gt;Data stored in Parquet or ORC with lightweight compression at the Bronze layer costs much less, is far more responsive for business queries, and lets organizations unlock value from vast volumes of raw data&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Question:&lt;/STRONG&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Like Kaggle are there any sources where i can get good quality ( unstructured, semi structured, structured combination ) of 100 Tb data&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;Is reading the 100Tb data like below, the recommended best practice ?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;# Step 1: Read raw data (CSV/JSON/Avro — update format as needed)&lt;BR /&gt;raw_df = (&lt;BR /&gt;spark.read.format("csv") # Change to "json" / "avro" if source differs&lt;BR /&gt;.option("header", "true") # Use header if CSV&lt;BR /&gt;.option("inferSchema", "true") # Infers schema (can be expensive for huge datasets)&lt;BR /&gt;&lt;STRONG&gt;.load("dbfs:/mnt/raw/huge_dataset/") # Path to raw 100TB dataset&lt;/STRONG&gt;&lt;BR /&gt;)&lt;/P&gt;&lt;P&gt;# Step 2: Write into Bronze layer with Parquet + Snappy compression&lt;BR /&gt;(&lt;BR /&gt;raw_df.write.format("parquet")&lt;BR /&gt;.option("compression", "snappy") # Lightweight compression for Bronze&lt;BR /&gt;.mode("overwrite") # Overwrite Bronze zone if rerun&lt;BR /&gt;.save("dbfs:/mnt/bronze/huge_dataset/") # Bronze layer storage path&lt;BR /&gt;)&lt;/P&gt;</description>
    <pubDate>Sun, 31 Aug 2025 20:15:44 GMT</pubDate>
    <dc:creator>ManojkMohan</dc:creator>
    <dc:date>2025-08-31T20:15:44Z</dc:date>
    <item>
      <title>Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130290#M48743</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Problem i am trying to solve:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Bronze is the landing zone for immutable, raw data.&lt;/P&gt;&lt;P&gt;At this stage, i am trying to sse a columnar format (Parquet or ORC) → good compression, efficient scans. and then apply lightweight compression (e.g., Snappy) → balances speed and size.&lt;/P&gt;&lt;P&gt;Data stored in Parquet or ORC with lightweight compression at the Bronze layer costs much less, is far more responsive for business queries, and lets organizations unlock value from vast volumes of raw data&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Question:&lt;/STRONG&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Like Kaggle are there any sources where i can get good quality ( unstructured, semi structured, structured combination ) of 100 Tb data&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;Is reading the 100Tb data like below, the recommended best practice ?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;# Step 1: Read raw data (CSV/JSON/Avro — update format as needed)&lt;BR /&gt;raw_df = (&lt;BR /&gt;spark.read.format("csv") # Change to "json" / "avro" if source differs&lt;BR /&gt;.option("header", "true") # Use header if CSV&lt;BR /&gt;.option("inferSchema", "true") # Infers schema (can be expensive for huge datasets)&lt;BR /&gt;&lt;STRONG&gt;.load("dbfs:/mnt/raw/huge_dataset/") # Path to raw 100TB dataset&lt;/STRONG&gt;&lt;BR /&gt;)&lt;/P&gt;&lt;P&gt;# Step 2: Write into Bronze layer with Parquet + Snappy compression&lt;BR /&gt;(&lt;BR /&gt;raw_df.write.format("parquet")&lt;BR /&gt;.option("compression", "snappy") # Lightweight compression for Bronze&lt;BR /&gt;.mode("overwrite") # Overwrite Bronze zone if rerun&lt;BR /&gt;.save("dbfs:/mnt/bronze/huge_dataset/") # Bronze layer storage path&lt;BR /&gt;)&lt;/P&gt;</description>
      <pubDate>Sun, 31 Aug 2025 20:15:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130290#M48743</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-08-31T20:15:44Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130293#M48744</link>
      <description>&lt;P&gt;hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/155141"&gt;@ManojkMohan&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;Great questions!&lt;BR /&gt;&lt;BR /&gt;On the first question around getting large example datasets, I'd recommend the following places:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;AWS Registry of Open data (&lt;A href="https://registry.opendata.aws/" target="_blank"&gt;https://registry.opendata.aws/&lt;/A&gt;)&lt;/LI&gt;&lt;LI&gt;Google Cloud BigQuery Public Datasets (&lt;A href="https://cloud.google.com/bigquery/public-data" target="_blank"&gt;https://cloud.google.com/bigquery/public-data&lt;/A&gt;)&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;At least the last time I had a dig into these, it was possible to get very large datasets from them. The other shout could be government open data portals.&lt;/P&gt;&lt;P&gt;On the second question, your direction is correct, but at 100TB I'd really recommend not inferring the schema if possible (this is very computationally expensive on data of that size). You should also consider not trying to read the entire 100TB at once, and break it into smaller incremental ingests. I would recommend taking a look at &lt;A href="https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/" target="_self"&gt;Auto Loader&lt;/A&gt;&amp;nbsp;which processes files incrementally and keeps track of what it has already ingested. This will be much more robust and reliable if its possible to be used in your use-case.&lt;/P&gt;</description>
      <pubDate>Sun, 31 Aug 2025 21:58:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130293#M48744</guid>
      <dc:creator>TheOC</dc:creator>
      <dc:date>2025-08-31T21:58:33Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130361#M48770</link>
      <description>&lt;P&gt;To add on this:&lt;BR /&gt;writing this data to parquet is not the issue here.&lt;BR /&gt;Just make sure the csv file is stored on a hdfs-enabled storage.&lt;BR /&gt;The hard part is making queries ion this parquet-data reasonably fast.&amp;nbsp; So you will probably need some performance tuning and besides bucketing/partitioning you are kinda limited in parquet.&lt;BR /&gt;Delta lake/Iceberg or Databricks managed tables with predictive optimization might be better choices here.&lt;BR /&gt;There is also parquet v2, which is not enabled by default, you might wanna look into (it has better compression).&amp;nbsp; The latter however I have not yet tested myself.&lt;/P&gt;</description>
      <pubDate>Mon, 01 Sep 2025 11:48:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130361#M48770</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2025-09-01T11:48:32Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130363#M48771</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/155141"&gt;@ManojkMohan&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;To jump in to conversion, is there any particular reason why you don't want to load that csv to Delta format? Delta has multiple advantages over reqular parquet.&lt;BR /&gt;Things like file skipping, predicate pushdown filtering are much more performant on delta. On delta you can apply z-ordering, liquid clustering etc and databricks can do some cool thing for you like predictive optimization if you use delta format.&lt;/P&gt;</description>
      <pubDate>Mon, 01 Sep 2025 12:16:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130363#M48771</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-09-01T12:16:42Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130403#M48779</link>
      <description>&lt;P&gt;My recommendation is to use delta tables with liquid clustering and a separate cloud storage for each bronze, silver or gold layer in your medallion architecture. Besides, schedule a periodic job to optimize deltas and vacuum "obsolete" parquet files. As from here, check how performance looks and try to fine tune.&lt;/P&gt;</description>
      <pubDate>Mon, 01 Sep 2025 15:56:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130403#M48779</guid>
      <dc:creator>Coffee77</dc:creator>
      <dc:date>2025-09-01T15:56:35Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130408#M48783</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/179536"&gt;@Coffee77&lt;/a&gt;&amp;nbsp;I'm curious why you'd want a separate cloud storage for each layer in the medallion architecture? I'd have thought that overcomplicates things. What exactly do you mean by "separate cloud storage", I think I'm misunderstanding &lt;span class="lia-unicode-emoji" title=":thinking_face:"&gt;🤔&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Mon, 01 Sep 2025 16:30:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130408#M48783</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-09-01T16:30:29Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130414#M48784</link>
      <description>&lt;P&gt;It is a suggestion based on the chosen architecture in the main Databricks application I am currently working. Let me explain in detail:&lt;/P&gt;&lt;P&gt;- One workspace per environment, all of them governed by Unity Catalog with two metastores.&lt;/P&gt;&lt;P&gt;- Each workspace contains three different catalogs, to cover "bronze", "silver" and "gold" layers. All of them using delta tables.&lt;/P&gt;&lt;P&gt;- Discussing with our main cloud architect (MVP Microsoft), we came to the conclusion that placing each catalog in a dedicated Azure Datalake Storage Account (ADLS Gen2) would be much better for performance than having only one ADLS Gen2 with separated containers per environment or similar. Latter case, thoughput would be shared by all layers in medallion architecture. So, with dedicated layers, performance is much better in I/O operations.&lt;/P&gt;&lt;P&gt;- so, each layer has a related catalog and each catalog is physically placed ina different Azure Storage account.&lt;/P&gt;&lt;P&gt;I hope this helps.&lt;/P&gt;</description>
      <pubDate>Mon, 01 Sep 2025 19:05:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130414#M48784</guid>
      <dc:creator>Coffee77</dc:creator>
      <dc:date>2025-09-01T19:05:01Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130418#M48786</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/179536"&gt;@Coffee77&lt;/a&gt;&amp;nbsp; From the docs, I thought it was 1 metastore per region and 1 Unity Catalog mapped to that:&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices#metastores" target="_blank" rel="noopener"&gt;https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices#metastores&lt;/A&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;You then have as many workspaces as needed within that region. Each workspace contains the 3-level namespace, catalog-&amp;gt;schema-&amp;gt;table. You can have as many catalogs (top level-name space) as you require. You can host your bronze silver gold within a single workspace, i.e. workspace called "dev".&amp;nbsp; Perhaps you guys already know about this region constraint?&lt;BR /&gt;&lt;BR /&gt;I think Unity Catalog abstracts the storage part away from us. Essentially, it sits on top of a single ADLS2, as you mention. Databricks nicely handles the storage under-the-hood for us i.e containers.&lt;BR /&gt;&lt;BR /&gt;I think the time you'd need to consider additional ADLS in when we exceed things like this:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="BS_THE_ANALYST_0-1756754789804.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19552i0947E6B0FAC0FE66/image-size/medium?v=v2&amp;amp;px=400" role="button" title="BS_THE_ANALYST_0-1756754789804.png" alt="BS_THE_ANALYST_0-1756754789804.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices#managed-storage" target="_blank" rel="noopener"&gt;https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices#managed-storage&lt;/A&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;That is a really interesting point and one I wouldn't have considered without you raising it.&amp;nbsp;I do fear it's easy to overcomplicate a design, though. Thanks for bringing this type of thing to my attention&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/179536"&gt;@Coffee77&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I must say, I'm pleasantly surprised to find these finer details in the documentation provided by Databricks.&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Mon, 01 Sep 2025 19:32:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130418#M48786</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-09-01T19:32:12Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130422#M48789</link>
      <description>&lt;P&gt;I work in a very large organization that has an special agreement with Databricks to be able to create multiple metastores per region. This is cool to not include all of our heterogeneous applications in same metastore.&lt;/P&gt;&lt;P&gt;On the other hand, we have a metastore managing dev, qa, stage and prod. As said, each environment with three catalogs placed into dedicated storage accounts but using unity catalog to manage everything (connections, etc.). The other metastore in a different region is for DR (Disaster Recovery) cross-region purposes, with an additional BCP ( Business Continuous Plan) environment...&lt;/P&gt;&lt;P&gt;Security and performance requirements are very strict in our case so that, we need to struggle to think whatever possible technique!&lt;/P&gt;</description>
      <pubDate>Mon, 01 Sep 2025 19:51:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130422#M48789</guid>
      <dc:creator>Coffee77</dc:creator>
      <dc:date>2025-09-01T19:51:39Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130423#M48790</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/179536"&gt;@Coffee77&lt;/a&gt;&amp;nbsp;that's awesome, I didn't even know that was possible. Every day is a school day &lt;span class="lia-unicode-emoji" title=":nerd_face:"&gt;🤓&lt;/span&gt;. That setup looks exciting...&lt;span class="lia-unicode-emoji" title=":face_with_tears_of_joy:"&gt;😂&lt;/span&gt; Thanks for sharing! &lt;span class="lia-unicode-emoji" title=":clapping_hands:"&gt;👏&lt;/span&gt;.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Mon, 01 Sep 2025 20:00:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130423#M48790</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-09-01T20:00:16Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130424#M48791</link>
      <description>&lt;P&gt;At all thanks for all your suggestions , trying the optimal next steps based on these responses, will have an update here with screen shots soon&lt;/P&gt;</description>
      <pubDate>Mon, 01 Sep 2025 20:11:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130424#M48791</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-09-01T20:11:51Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130425#M48792</link>
      <description>&lt;P&gt;Same for me, lifelong learner &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 01 Sep 2025 20:26:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130425#M48792</guid>
      <dc:creator>Coffee77</dc:creator>
      <dc:date>2025-09-01T20:26:38Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130559#M48831</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/171339"&gt;@TheOC&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/146924"&gt;@BS_THE_ANALYST&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/179536"&gt;@Coffee77&lt;/a&gt; the latest on this&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Batch Ingestion (10 TB Synthetic Data)&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Data Generation&lt;/LI&gt;&lt;UL&gt;&lt;LI&gt;Spark generates 100 billion synthetic rows (~10 TB).&lt;/LI&gt;&lt;LI&gt;Columns: id, random_val, category, payload.&lt;/LI&gt;&lt;LI&gt;Partitioned across 10,000 partitions for parallelism.&lt;/LI&gt;&lt;/UL&gt;&lt;LI&gt;Data Storage&lt;/LI&gt;&lt;UL&gt;&lt;LI&gt;Uses Databricks Unity Catalog: Catalog = 10tb, Schema = bronze.&lt;/LI&gt;&lt;LI&gt;Data is written as a managed Delta table: bronze.synthetic_10tb.&lt;/LI&gt;&lt;LI&gt;This is batch ingestion — one-time write of massive dataset&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ManojkMohan_0-1756846467272.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19587iF10D83013595E693/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ManojkMohan_0-1756846467272.png" alt="ManojkMohan_0-1756846467272.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/OL&gt;&lt;P&gt;Streaming Ingestion (~500 MB/day)&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Data Generation&lt;/LI&gt;&lt;UL&gt;&lt;LI&gt;Spark structured streaming using rate source (~50 rows/sec → 500 MB/day).&lt;/LI&gt;&lt;LI&gt;Columns: id, random_val, category, payload.&lt;/LI&gt;&lt;/UL&gt;&lt;LI&gt;Checkpointing&lt;/LI&gt;&lt;UL&gt;&lt;LI&gt;Required for exactly-once guarantees.&lt;/LI&gt;&lt;LI&gt;Stored on S3 bucket: s3://streamingdataproto735/checkpoints/synthetic_500mb_continuous.&lt;/LI&gt;&lt;/UL&gt;&lt;LI&gt;Data Storage&lt;/LI&gt;&lt;UL&gt;&lt;LI&gt;Appends continuously to the same Bronze Delta table: 10tb.bronze.synthetic_10tb.&lt;/LI&gt;&lt;LI&gt;Supports indefinite streaming ingestion with micro-batches (availableNow or timed trigger).&amp;nbsp;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ManojkMohan_1-1756846485273.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19588i30A80386465D3505/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ManojkMohan_1-1756846485273.png" alt="ManojkMohan_1-1756846485273.png" /&gt;&lt;/span&gt;&lt;P&gt;currently resolving error of AWS creds config&amp;nbsp; . Request your thoughts in parallel ?&amp;nbsp; WIll summarise all learnings and publish a knowledge article on collective learnings&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/OL&gt;</description>
      <pubDate>Tue, 02 Sep 2025 20:55:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130559#M48831</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-09-02T20:55:31Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130629#M48856</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/155141"&gt;@ManojkMohan&lt;/a&gt;&amp;nbsp;You could just do the checkpointing inside a volume within Unity Catalog. What's the benefit to having this externally?&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I think resolving the AWS creds config is still good for learning but you can bypass that.&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Wed, 03 Sep 2025 09:50:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130629#M48856</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-09-03T09:50:26Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130895#M48935</link>
      <description>&lt;P&gt;The use case:&amp;nbsp;&lt;/P&gt;&lt;P&gt;A telecom operator wants to minimize unnecessary truck rolls (sending technicians to customer sites), which cost $100–$200 per visit.&lt;/P&gt;&lt;P&gt;Data sources feeding into the data platform:&lt;/P&gt;&lt;P&gt;Network telemetry&amp;nbsp;– SNMP traps, modem/router health (e.g., SNR, packet loss, outages).&lt;BR /&gt;IoT device data&amp;nbsp;– ONT, set-top boxes, CPE logs.&lt;BR /&gt;CRM &amp;amp; Billing data&amp;nbsp;– open tickets, service type, SLA tiers.&lt;BR /&gt;Geospatial/weather feeds&amp;nbsp;– storm events, regional outages.&lt;BR /&gt;Technician logs&amp;nbsp;– prior visit outcomes.&lt;BR /&gt;All this lands in the Bronze layer as unstructured JSON, CSV, log files, and streaming events.&lt;/P&gt;&lt;P&gt;Why Parquet in Silver Layer?&lt;/P&gt;&lt;P&gt;The Silver layer aggregates and cleans this into a customer/equipment-level service health dataset:&lt;/P&gt;&lt;P&gt;Customer ID, Service ID, Site ID&lt;BR /&gt;Last 24h modem health KPIs (uptime, SNR, packet loss)&lt;BR /&gt;Outage correlation (area-wide vs. local issue)&lt;BR /&gt;Historical technician visits and resolution codes&lt;BR /&gt;Predictive probability: "Can this issue be fixed remotely?"&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Benefits of Parquet here which i am tyring to achieeve:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Efficient analytics&amp;nbsp;– Technicians need KPIs by device or site; Parquet’s columnar format makes queries 5–10× faster.&lt;BR /&gt;Compression&amp;nbsp;– IoT + network telemetry is massive; Parquet reduces footprint dramatically.&lt;BR /&gt;Schema evolution&amp;nbsp;– New device types (5G routers, IoT sensors) can be added without breaking downstream integrations.&lt;BR /&gt;Reusability&amp;nbsp;– Same Parquet data powers ML models (predicting if a truck roll is necessary) and operational dashboards.&lt;BR /&gt;But you have a very valid suggestion i am trying&amp;nbsp;On ingest, also write to a Delta table with minimal transformation. This becomes the query-friendly version of raw.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Sep 2025 20:15:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130895#M48935</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-09-04T20:15:19Z</dc:date>
    </item>
    <item>
      <title>Re: Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy</title>
      <link>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130896#M48936</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/146924"&gt;@BS_THE_ANALYST&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/179536"&gt;@Coffee77&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/171339"&gt;@TheOC&lt;/a&gt;&amp;nbsp; the use case summary is as eblow&amp;nbsp;&lt;/P&gt;&lt;P&gt;The use case:&amp;nbsp;&lt;/P&gt;&lt;P&gt;A telecom operator wants to minimize unnecessary truck rolls (sending technicians to customer sites), which cost $100–$200 per visit.&lt;/P&gt;&lt;P&gt;Data sources feeding into the data platform:&lt;/P&gt;&lt;P&gt;Network telemetry&amp;nbsp;– SNMP traps, modem/router health (e.g., SNR, packet loss, outages).&lt;BR /&gt;IoT device data&amp;nbsp;– ONT, set-top boxes, CPE logs.&lt;BR /&gt;CRM &amp;amp; Billing data&amp;nbsp;– open tickets, service type, SLA tiers.&lt;BR /&gt;Geospatial/weather feeds&amp;nbsp;– storm events, regional outages.&lt;BR /&gt;Technician logs&amp;nbsp;– prior visit outcomes.&lt;BR /&gt;All this lands in the Bronze layer as unstructured JSON, CSV, log files, and streaming events.&lt;/P&gt;&lt;P&gt;Why Parquet in Silver Layer?&lt;/P&gt;&lt;P&gt;The Silver layer aggregates and cleans this into a customer/equipment-level service health dataset:&lt;/P&gt;&lt;P&gt;Customer ID, Service ID, Site ID&lt;BR /&gt;Last 24h modem health KPIs (uptime, SNR, packet loss)&lt;BR /&gt;Outage correlation (area-wide vs. local issue)&lt;BR /&gt;Historical technician visits and resolution codes&lt;BR /&gt;Predictive probability: "Can this issue be fixed remotely?"&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Benefits of Parquet here which i am tyring to achieeve:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Efficient analytics&amp;nbsp;– Technicians need KPIs by device or site; Parquet’s columnar format makes queries 5–10× faster.&lt;BR /&gt;Compression&amp;nbsp;– IoT + network telemetry is massive; Parquet reduces footprint dramatically.&lt;BR /&gt;Schema evolution&amp;nbsp;– New device types (5G routers, IoT sensors) can be added without breaking downstream integrations.&lt;BR /&gt;Reusability&amp;nbsp;– Same Parquet data powers ML models (predicting if a truck roll is necessary) and operational dashboards.&lt;BR /&gt;But you have a very valid suggestion i am trying&amp;nbsp;On ingest, also write to a Delta table with minimal transformation. This becomes the query-friendly version of raw.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Sep 2025 20:16:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/ingesting-100-tb-raw-csv-data-into-the-bronze-layer-in-parquet/m-p/130896#M48936</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-09-04T20:16:47Z</dc:date>
    </item>
  </channel>
</rss>

