Databricks Community

Indra · ‎01-18-2023

Hi,

Our team is using Simba ODBC to perform data loading to Deltalake, and For a table with 3 columns it took around 55 seconds to insert 15 records.

How to improve transactional loading into Deltalake? is there any option from the Simba ODBC driver to leverage Bulk Load for Deltalake (very important for Data Migration) ? Or is there a way in Deltalake to configure its table to support/perform better for transactional workload (very important for Daily Data synchornization from Source to Deltalake) ?

Thanks

Anonymous · ‎04-10-2023

@Indra Limena :

There are several ways to improve transactional loading into Delta Lake:

Use Delta Lake's native Delta JDBC/ODBC connector instead of a third-party ODBC driver like Simba. The native connector is optimized for Delta Lake and supports bulk inserts, which can significantly improve performance.
Use the Delta Lake bulk insert API to load data in batches instead of inserting one record at a time. This can also significantly improve performance.
Use the Delta Lake streaming API to load data in real-time as it becomes available. This can be useful for use cases where you need to load data as soon as it becomes available.
Partition your Delta Lake tables by a key column that you frequently use to filter the data. This can improve query performance by reducing the amount of data that needs to be scanned.
Use Delta Lake's Z-Ordering feature to physically organize the data in the table based on one or more columns. This can further improve query performance by allowing Delta Lake to skip entire files or partitions that don't contain the relevant data.

As for the Simba ODBC driver, it's possible that there is an option to leverage bulk loading, but you would need to consult the documentation or contact the vendor to find out. However, even if there is an option to use bulk loading, it may not be as optimized as the native Delta Lake connector or the bulk insert API.

In general, if you're looking to perform bulk data migration or daily data synchronization from a source system to Delta Lake, it's recommended to use a tool that is optimized for that use case, such as Apache NiFi or Apache Airflow. These tools can handle large volumes of data and provide mechanisms for efficient and reliable data transfer.

Databricks Community

Performance issue with Simba ODBC Driver to perform simple insert command to Deltalake

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences