Databricks Community

Data4 · ‎06-29-2023

What’s the best way to efficiently move multiple sql tables in parallel into delta tables

Tharun-Kumar · ‎07-18-2023

To enable parallel read and write operations, the ThreadPool functionality can be leveraged. This process involves specifying a list of tables that need to be read, creating a method for reading these tables from the JDBC source and saving them in DELTA format, and subsequently employing the ThreadPool for parallel processing.

Prepare the List of Tables for Reading: Before proceeding with parallel read and write operations, you must first assemble a list of tables to be read. The list should be defined as follows:

 iterationList = ["table1", "table2", "table3"]

In this step, a method named "transformation" will be implemented to handle the reading of tables from the JDBC source and saving them in DELTA format. The transformation method will be structured as follows:

def transformation(table): 
   print(f"Thread for: {table} starts") 
   spark.read.format("jdbc").load(table).write.saveAsTable(table) 
   print(f"Thread for: {table} completed")

To achieve parallel processing, the ThreadPool with an appropriate number of threads can be initialized. In this example, we will create a ThreadPool with 3 threads. The "map" function of the ThreadPool is then employed to execute the "transformation" method in parallel for each table in the "iterationList".

from multiprocessing.pool 
import ThreadPool 

# Initialize ThreadPool with 3 threads 
pool = ThreadPool(3) 

# Execute the transformation method in parallel for each table in iterationList 
pool.map(transformation, iterationList)

View solution in original post

Tharun-Kumar · ‎07-18-2023

@Data4

To enable parallel read and write operations, the ThreadPool functionality can be leveraged. This process involves specifying a list of tables that need to be read, creating a method for reading these tables from the JDBC source and saving them in DELTA format, and subsequently employing the ThreadPool for parallel processing.

Prepare the List of Tables for Reading: Before proceeding with parallel read and write operations, you must first assemble a list of tables to be read. The list should be defined as follows:

 iterationList = ["table1", "table2", "table3"]

In this step, a method named "transformation" will be implemented to handle the reading of tables from the JDBC source and saving them in DELTA format. The transformation method will be structured as follows:

def transformation(table): 
   print(f"Thread for: {table} starts") 
   spark.read.format("jdbc").load(table).write.saveAsTable(table) 
   print(f"Thread for: {table} completed")

To achieve parallel processing, the ThreadPool with an appropriate number of threads can be initialized. In this example, we will create a ThreadPool with 3 threads. The "map" function of the ThreadPool is then employed to execute the "transformation" method in parallel for each table in the "iterationList".

from multiprocessing.pool 
import ThreadPool 

# Initialize ThreadPool with 3 threads 
pool = ThreadPool(3) 

# Execute the transformation method in parallel for each table in iterationList 
pool.map(transformation, iterationList)

Databricks Community

Load multiple delta tables at once from Sql server

Connect with Databricks Users in Your Area

Submit your feedback and win a $50 gift card!

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!