topic Re: Java heap issue, GC allocation failure while writing data from mysql to adls in Data Engineering

Java heap issue, GC allocation failure while writing data from mysql to adls

sshukla — Wed, 18 Sep 2024 20:52:23 GMT

Hi Team,

I am reading 60 million -80million data from mysql server and writing into ADLS in parquet format but i am getting java heap issue, GC allocation failure and out of memory issue.

below are my cluster configuration

Driver - 56GB Ram, 16 core

Worker - 56GB Ram, 16 core

autos-calling enabled with min 4 worker to max 8 worker

could you please help to resolve the issue ?

after reading the data from mysql server

df.count() is giving me the result but df.write is failing with above mentioned error

i have tried with df.repartition() from 128 to 1024 but no luck also tried salting but dit now work for df.write.parquet

Re: Java heap issue, GC allocation failure while writing data from mysql to adls

sshukla — Wed, 18 Sep 2024 20:54:58 GMT

team help me to resolve the problem

Re: Java heap issue, GC allocation failure while writing data from mysql to adls

sshukla — Wed, 18 Sep 2024 21:33:55 GMT

every time I am seeing data behaviour like this

Re: Java heap issue, GC allocation failure while writing data from mysql to adls

Witold — Thu, 19 Sep 2024 08:17:28 GMT

How do you read and write the records? Which cluster size do you use?

Re: Java heap issue, GC allocation failure while writing data from mysql to adls

sshukla — Thu, 19 Sep 2024 09:18:46 GMT

Hi,

below is the code

driver=“com.mysql.cj. jdbc.Driver'
database_host =“ip address"
database_port=“3306"
database_name =“database"
table =“table"
user ="user”
password =“0password"
url = f"jdbc:mysq]://(database_host): (database_port)/(database_name)?zeroDateTimeBehavior=CONVERT_TO_NULL”
remote_ tablel=
spark.read
format ("idb")
.option ("driver", driver)
.option ("url", url)
.option ("query", "select * from database.table where deleted =0")
.option ("user", user)
.option ("password", password)
.option ("maxRows InMemory",
5000000)

remote_table1.write.format ("parquet"). partitionBy("name") .save("/mt/sm/process/replica/datalake/myuday/datalake/2024/09/19”)

cluster config-

worker type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12

driver type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12

Re: Java heap issue, GC allocation failure while writing data from mysql to adls

Witold — Thu, 19 Sep 2024 09:36:47 GMT

Multiple things.

First very obvious thing: "5000000" - It's not surprising that you run OOM when loading such a huge amount of records.
The DBR has a built-in MySQL driver, use it instead
Use fetchSize to control the number of records. Start e.g. with 1000 and measure the performance. Adapt it if needed
If the source table has a partition column, use it as described here.

Re: Java heap issue, GC allocation failure while writing data from mysql to adls

sshukla — Thu, 19 Sep 2024 09:59:19 GMT

thanks for the reply

let me try the approaches which you mentioned and see the performance. will update you shortly.

Re: Java heap issue, GC allocation failure while writing data from mysql to adls

sshukla — Thu, 19 Sep 2024 10:53:55 GMT

Hi @Witold

After trying
table = (spark.read
.format("jdbc")
.option("url", "<jdbc-url>")
.option("dbtable", "<table-name>")
.option("user", "<username>")
.option("password", "<password>")
.option("fetchSize", "1000") -- to 50000
.load()
)

job is taking lots of time and not even reading 10 million records.

Re: Java heap issue, GC allocation failure while writing data from mysql to adls

sshukla — Thu, 19 Sep 2024 14:03:37 GMT

Hi @Witold
Now i am able to read the data but one issue i am seeing is that out of 8 executor 3 are getting success in just 2-3 sec and rest 5 are running why this behavior.

below is the code

remote_table1 = (spark.read

.format("jdbc")

.option("url", url)

.option("dbtable","(select * from db.table) temp")

.option("user", user)

.option("password", password)

.option("fetchSize", "50000")

.option("lowerBound", "1")

.option("upperBound", "12")

.option("partitionColumn", "month")

.option("numPartitions", 8)

.load()

)

also please find snapshot of execution

Re: Java heap issue, GC allocation failure while writing data from mysql to adls

shaza606 — Thu, 19 Sep 2024 14:18:06 GMT

Hello good man