Databricks Community

Sudharsan24 · ‎07-23-2024

While ingesting data from Oracle to databricks(writing into ADLS) using jdbc I am getting connection reset by peer error when ingesting a large table which has millions of rows.I am using oracle sql developer and azure databricks.

I tried every way like using partition column (lower and upper bounds), predicates and also incremental loading none of them are working, please help me if anyone knows the solution
jdbc(url=conn, table= pQuery, properties={ "user": user, "password": pwd, "driver": driver, "autoReconnect": "true", #"numPartitions": "50", "numPartitions": "20", "partitionColumn": "PARTITION_KEY", "lowerBound": lbound, "upperBound": ubound, #"fetchSize": "90000000" "fetchSize": "900000" } )

Rishabh-Pandey · ‎07-23-2024

Try using this code .

import pyspark
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("OracleToDatabricks").getOrCreate()

# Oracle connection properties
conn = "jdbc:oracle:thin:@//<host>:<port>/<service_name>"
user = "<username>"
pwd = "<password>"
driver = "oracle.jdbc.driver.OracleDriver"
pQuery = "<table_name>"
lbound = 1
ubound = 1000000
batch_size = 10000

properties = {
    "user": user,
    "password": pwd,
    "driver": driver,
    "autoReconnect": "true",
    "numPartitions": "20",
    "partitionColumn": "PARTITION_KEY",
    "lowerBound": lbound,
    "upperBound": ubound,
    "fetchSize": "10000"
}

for i in range(lbound, ubound, batch_size):
    lower_bound = i
    upper_bound = min(i + batch_size - 1, ubound)
    query = f"(SELECT * FROM {pQuery} WHERE PARTITION_KEY >= {lower_bound} AND PARTITION_KEY <= {upper_bound}) AS TEMP"
    df = spark.read.jdbc(url=conn, table=query, properties=properties)
    # Process and write the data to ADLS
    df.write.mode("append").parquet("path/to/adls")

Rishabh Pandey