cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Load multiple delta tables at once from Sql server

Data4
New Contributor II

Whatโ€™s the best way to efficiently move multiple sql tables in parallel into delta tables

1 ACCEPTED SOLUTION

Accepted Solutions

Tharun-Kumar
Honored Contributor II
Honored Contributor II

@Data4   

To enable parallel read and write operations, the ThreadPool functionality can be leveraged. This process involves specifying a list of tables that need to be read, creating a method for reading these tables from the JDBC source and saving them in DELTA format, and subsequently employing the ThreadPool for parallel processing.

  • Prepare the List of Tables for Reading: Before proceeding with parallel read and write operations, you must first assemble a list of tables to be read. The list should be defined as follows:

 iterationList = ["table1", "table2", "table3"]
  • In this step, a method named "transformation" will be implemented to handle the reading of tables from the JDBC source and saving them in DELTA format. The transformation method will be structured as follows:
 
def transformation(table): 
print(f"Thread for: {table} starts")
spark.read.format("jdbc").load(table).write.saveAsTable(table)
print(f"Thread for: {table} completed")
  • To achieve parallel processing, the ThreadPool with an appropriate number of threads can be initialized. In this example, we will create a ThreadPool with 3 threads. The "map" function of the ThreadPool is then employed to execute the "transformation" method in parallel for each table in the "iterationList".

from multiprocessing.pool 
import ThreadPool

# Initialize ThreadPool with 3 threads
pool = ThreadPool(3)

# Execute the transformation method in parallel for each table in iterationList
pool.map(transformation, iterationList)
 

View solution in original post

1 REPLY 1

Tharun-Kumar
Honored Contributor II
Honored Contributor II

@Data4   

To enable parallel read and write operations, the ThreadPool functionality can be leveraged. This process involves specifying a list of tables that need to be read, creating a method for reading these tables from the JDBC source and saving them in DELTA format, and subsequently employing the ThreadPool for parallel processing.

  • Prepare the List of Tables for Reading: Before proceeding with parallel read and write operations, you must first assemble a list of tables to be read. The list should be defined as follows:

 iterationList = ["table1", "table2", "table3"]
  • In this step, a method named "transformation" will be implemented to handle the reading of tables from the JDBC source and saving them in DELTA format. The transformation method will be structured as follows:
 
def transformation(table): 
print(f"Thread for: {table} starts")
spark.read.format("jdbc").load(table).write.saveAsTable(table)
print(f"Thread for: {table} completed")
  • To achieve parallel processing, the ThreadPool with an appropriate number of threads can be initialized. In this example, we will create a ThreadPool with 3 threads. The "map" function of the ThreadPool is then employed to execute the "transformation" method in parallel for each table in the "iterationList".

from multiprocessing.pool 
import ThreadPool

# Initialize ThreadPool with 3 threads
pool = ThreadPool(3)

# Execute the transformation method in parallel for each table in iterationList
pool.map(transformation, iterationList)
 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group