06-05-2023 02:57 PM
My source can only deliver CSV format (pipe delimited).
My source has the ability to generate multiple CSV files and transfer them to a single upload folder.
All rows must go to the same target bronze delta table.
I do not care about the order in which the rows are loaded.
The bronze target table columns can all be strings.
I am trying to find out if:
A: Is uploading multiple csv files and loading them with a single stream reader / stream writer statement the quickest way to load this data? In other words are multiple input files the way to introduce parallelism into the process? And if it is a single csv file, is it single threaded, is this correct?
B: Is there some optimal number of files and/or file sizes that the source data should be broken down into in order to achieve the ingestion speed. For example is 1-the number of worker nodes, or a multiple thereof, be the number of files I wish load? And does size matter, or is it just the number of input files matching a multiple of worker nodes all that counts?
C: Is there anything else I should be doing to improve the load times.
06-05-2023 10:05 PM
@Michael Popp
In my opinion, the best way would be to split the file to some partitions (you need to find the best-fit column) and to ingest them using Autoloader with trigger=AvailableNow (batching) and writing to the same partition as the file is partitioned.
It will allow to achieve both - parallelism and avoid data skew.
06-13-2023 07:36 AM
@Michael Popp :
To load a high volume of CSV rows in the fastest way possible into a Delta table in Databricks, you can follow these approaches and optimizations:
A. Uploading and loading multiple CSV files:
B. Optimal number of files and file sizes:
C. Other optimizations to improve load times:
By considering these approaches and optimizations, you can significantly improve the load times for your high-volume CSV data ingestion into a Delta table in Databricks.
06-05-2023 10:05 PM
@Michael Popp
In my opinion, the best way would be to split the file to some partitions (you need to find the best-fit column) and to ingest them using Autoloader with trigger=AvailableNow (batching) and writing to the same partition as the file is partitioned.
It will allow to achieve both - parallelism and avoid data skew.
06-05-2023 11:12 PM
tried loading the CSV file with this option: maxRowsInMemory??
06-13-2023 07:36 AM
@Michael Popp :
To load a high volume of CSV rows in the fastest way possible into a Delta table in Databricks, you can follow these approaches and optimizations:
A. Uploading and loading multiple CSV files:
B. Optimal number of files and file sizes:
C. Other optimizations to improve load times:
By considering these approaches and optimizations, you can significantly improve the load times for your high-volume CSV data ingestion into a Delta table in Databricks.
06-14-2023 12:07 AM
Hi @Michael Popp
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group