cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Want to load a high volume of CSV rows in the fastest way possible (in excess of 5 billion rows). I want the best approach, in terms of speed, for loading into the bronze table.

Michael42
New Contributor III

My source can only deliver CSV format (pipe delimited).

My source has the ability to generate multiple CSV files and transfer them to a single upload folder.

All rows must go to the same target bronze delta table.

I do not care about the order in which the rows are loaded.

The bronze target table columns can all be strings.

I am trying to find out if:

A: Is uploading multiple csv files and loading them with a single stream reader / stream writer statement the quickest way to load this data? In other words are multiple input files the way to introduce parallelism into the process? And if it is a single csv file, is it single threaded, is this correct?

B: Is there some optimal number of files and/or file sizes that the source data should be broken down into in order to achieve the ingestion speed. For example is 1-the number of worker nodes, or a multiple thereof, be the number of files I wish load? And does size matter, or is it just the number of input files matching a multiple of worker nodes all that counts?

C: Is there anything else I should be doing to improve the load times.

2 ACCEPTED SOLUTIONS

Accepted Solutions

daniel_sahal
Esteemed Contributor

@Michael Popp​ 

In my opinion, the best way would be to split the file to some partitions (you need to find the best-fit column) and to ingest them using Autoloader with trigger=AvailableNow (batching) and writing to the same partition as the file is partitioned.

It will allow to achieve both - parallelism and avoid data skew.

View solution in original post

Anonymous
Not applicable

@Michael Popp​ :

To load a high volume of CSV rows in the fastest way possible into a Delta table in Databricks, you can follow these approaches and optimizations:

A. Uploading and loading multiple CSV files:

  1. Yes, uploading multiple CSV files and loading them with a single stream reader/stream writer statement can introduce parallelism into the process. By leveraging multiple files, you can take advantage of parallelism in Databricks to load the data more quickly.
  2. If you have a single CSV file, it will be loaded using a single thread, which can be slower compared to parallel processing with multiple files.

B. Optimal number of files and file sizes:

  1. The optimal number of files can be determined based on the number of worker nodes available in your Databricks cluster. To achieve parallelism, you can have a number of files equal to or multiple of the worker nodes.
  2. Additionally, the file sizes also matter. It is recommended to have reasonably sized files that can be processed efficiently. Very large files may cause memory and performance issues, so you may want to split the data into smaller files.
  3. Experimentation with different file sizes and numbers of files can help you determine the optimal configuration for your specific use case. You can try adjusting the file sizes and number of files to find the best balance between parallelism and efficient processing.

C. Other optimizations to improve load times:

  1. Utilize cluster auto-scaling: Enable cluster auto-scaling to automatically add or remove worker nodes based on the workload. This helps handle increased load during data ingestion and speeds up the process.
  2. Partitioning and bucketing: If possible, consider partitioning and bucketing your data in the Delta table. Partitioning can improve query performance, and bucketing can further enhance the efficiency of data retrieval.
  3. Use proper compression: Choose the appropriate compression algorithm for your data to reduce storage and improve I/O performance. Snappy or gzip compression is commonly used.
  4. Optimize the number of columns: If your target table has many columns, consider only selecting the required columns during the load process. This can reduce the data size and improve loading performance.
  5. Tune cluster settings: Adjust the cluster configuration based on the workload. Increase the number of cores, memory, and shuffle partitions to optimize the performance during data ingestion.
  6. Monitor and optimize the write performance: Keep an check on the write performance of your Delta table and tune the batch sizes and other parameters accordingly.

By considering these approaches and optimizations, you can significantly improve the load times for your high-volume CSV data ingestion into a Delta table in Databricks.

View solution in original post

4 REPLIES 4

daniel_sahal
Esteemed Contributor

@Michael Popp​ 

In my opinion, the best way would be to split the file to some partitions (you need to find the best-fit column) and to ingest them using Autoloader with trigger=AvailableNow (batching) and writing to the same partition as the file is partitioned.

It will allow to achieve both - parallelism and avoid data skew.

Shivanshu_
New Contributor III

tried loading the CSV file with this option: maxRowsInMemory??

Anonymous
Not applicable

@Michael Popp​ :

To load a high volume of CSV rows in the fastest way possible into a Delta table in Databricks, you can follow these approaches and optimizations:

A. Uploading and loading multiple CSV files:

  1. Yes, uploading multiple CSV files and loading them with a single stream reader/stream writer statement can introduce parallelism into the process. By leveraging multiple files, you can take advantage of parallelism in Databricks to load the data more quickly.
  2. If you have a single CSV file, it will be loaded using a single thread, which can be slower compared to parallel processing with multiple files.

B. Optimal number of files and file sizes:

  1. The optimal number of files can be determined based on the number of worker nodes available in your Databricks cluster. To achieve parallelism, you can have a number of files equal to or multiple of the worker nodes.
  2. Additionally, the file sizes also matter. It is recommended to have reasonably sized files that can be processed efficiently. Very large files may cause memory and performance issues, so you may want to split the data into smaller files.
  3. Experimentation with different file sizes and numbers of files can help you determine the optimal configuration for your specific use case. You can try adjusting the file sizes and number of files to find the best balance between parallelism and efficient processing.

C. Other optimizations to improve load times:

  1. Utilize cluster auto-scaling: Enable cluster auto-scaling to automatically add or remove worker nodes based on the workload. This helps handle increased load during data ingestion and speeds up the process.
  2. Partitioning and bucketing: If possible, consider partitioning and bucketing your data in the Delta table. Partitioning can improve query performance, and bucketing can further enhance the efficiency of data retrieval.
  3. Use proper compression: Choose the appropriate compression algorithm for your data to reduce storage and improve I/O performance. Snappy or gzip compression is commonly used.
  4. Optimize the number of columns: If your target table has many columns, consider only selecting the required columns during the load process. This can reduce the data size and improve loading performance.
  5. Tune cluster settings: Adjust the cluster configuration based on the workload. Increase the number of cores, memory, and shuffle partitions to optimize the performance during data ingestion.
  6. Monitor and optimize the write performance: Keep an check on the write performance of your Delta table and tune the batch sizes and other parameters accordingly.

By considering these approaches and optimizations, you can significantly improve the load times for your high-volume CSV data ingestion into a Delta table in Databricks.

Anonymous
Not applicable

Hi @Michael Popp​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.