Databricks Community

sshukla · ‎09-18-2024

Hi Team,

I am reading 60 million -80million data from mysql server and writing into ADLS in parquet format but i am getting java heap issue, GC allocation failure and out of memory issue.

below are my cluster configuration

Driver - 56GB Ram, 16 core

Worker - 56GB Ram, 16 core

autos-calling enabled with min 4 worker to max 8 worker

could you please help to resolve the issue ?

after reading the data from mysql server

df.count() is giving me the result but df.write is failing with above mentioned error

i have tried with df.repartition() from 128 to 1024 but no luck also tried salting but dit now work for df.write.parquet

sshukla · ‎09-18-2024

team help me to resolve the problem

sshukla · ‎09-18-2024

every time I am seeing data behaviour like this

Witold · ‎09-19-2024

How do you read and write the records? Which cluster size do you use?

sshukla · ‎09-19-2024

Hi,

below is the code

driver=“com.mysql.cj. jdbc.Driver'
database_host =“ip address"
database_port=“3306"
database_name =“database"
table =“table"
user ="user”
password =“0password"
url = f"jdbc:mysq]://(database_host): (database_port)/(database_name)?zeroDateTimeBehavior=CONVERT_TO_NULL”
remote_ tablel=
spark.read
format ("idb")
.option ("driver", driver)
.option ("url", url)
.option ("query", "select * from database.table where deleted =0")
.option ("user", user)
.option ("password", password)
.option ("maxRows InMemory",
5000000)

remote_table1.write.format ("parquet"). partitionBy("name") .save("/mt/sm/process/replica/datalake/myuday/datalake/2024/09/19”)

cluster config-

worker type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12

driver type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12

Witold · ‎09-19-2024

Multiple things.

First very obvious thing: "5000000" - It's not surprising that you run OOM when loading such a huge amount of records.
The DBR has a built-in MySQL driver, use it instead
Use fetchSize to control the number of records. Start e.g. with 1000 and measure the performance. Adapt it if needed
If the source table has a partition column, use it as described here.

sshukla · ‎09-19-2024

thanks for the reply

let me try the approaches which you mentioned and see the performance. will update you shortly.

sshukla · ‎09-19-2024

Hi @Witold
Now i am able to read the data but one issue i am seeing is that out of 8 executor 3 are getting success in just 2-3 sec and rest 5 are running why this behavior.

below is the code

remote_table1 = (spark.read

.format("jdbc")

.option("url", url)

.option("dbtable","(select * from db.table) temp")

.option("user", user)

.option("password", password)

.option("fetchSize", "50000")

.option("lowerBound", "1")

.option("upperBound", "12")

.option("partitionColumn", "month")

.option("numPartitions", 8)

.load()

)

also please find snapshot of execution

sshukla · ‎09-19-2024

Hi @Witold

After trying
table = (spark.read
.format("jdbc")
.option("url", "<jdbc-url>")
.option("dbtable", "<table-name>")
.option("user", "<username>")
.option("password", "<password>")
.option("fetchSize", "1000") -- to 50000
.load()
)

job is taking lots of time and not even reading 10 million records.