cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Java heap issue, GC allocation failure while writing data from mysql to adls

sshukla
Visitor

Hi Team,

I am reading 60 million -80million data from mysql server and writing into ADLS in parquet format but i am getting java heap issue, GC allocation failure and out of memory issue.

below are my cluster configuration  

Driver - 56GB Ram, 16 core

Worker - 56GB Ram, 16 core

autos-calling enabled with min 4 worker to max 8 worker 

could you please help to resolve the issue ?

after reading the data from mysql server 

df.count() is giving me the result but df.write is failing with above mentioned error

i have tried with df.repartition() from 128 to 1024 but no luck also tried salting but dit now work for df.write.parquet

9 REPLIES 9

sshukla
Visitor

team help me to resolve the problem 

sshukla
Visitor

IMG_4903.jpeg

every time I am seeing data behaviour like this 

Witold
Contributor III

How do you read and write the records? Which cluster size do you use?

Hi,

below is the code

driver=โ€œcom.mysql.cj. jdbc.Driver'
database_host =โ€œip address"
database_port=โ€œ3306"
database_name =โ€œdatabase"
table =โ€œtable"
user ="userโ€
password =โ€œ0password"
url = f"jdbc:mysq]://(database_host): (database_port)/(database_name)?zeroDateTimeBehavior=CONVERT_TO_NULLโ€
remote_ tablel=
spark.read
format ("idb")
.option ("driver", driver)
.option ("url", url)
.option ("query", "select * from database.table where deleted =0")
.option ("user", user)
.option ("password", password)
.option ("maxRows InMemory",
5000000)

remote_table1.write.format ("parquet"). partitionBy("name") .save("/mt/sm/process/replica/datalake/myuday/datalake/2024/09/19โ€)

 

cluster config-

worker type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12

driver type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12

Witold
Contributor III

Multiple things.

  1. First very obvious thing: "5000000" - It's not surprising that you run OOM when loading such a huge amount of records.
  2. The DBR has a built-in MySQL driver, use it instead
  3. Use fetchSize to control the number of records. Start e.g. with 1000 and measure the performance. Adapt it if needed
  4. If the source table has a partition column, use it as described here.

 

thanks for the reply

let me try the approaches which you mentioned and see the performance. will update you shortly.

Hi @Witold 
Now i am able to read the data but one issue i am seeing is that out of 8 executor 3 are getting success in just 2-3 sec and rest 5 are running why this behavior.

below is the code

remote_table1 = (spark.read
  .format("jdbc")
  .option("url", url)
  .option("dbtable","(select * from db.table) temp")
  .option("user", user)
  .option("password", password)
  .option("fetchSize", "50000")
  .option("lowerBound", "1")
  .option("upperBound", "12")
  .option("partitionColumn", "month")
  .option("numPartitions", 8)
  .load()
)
also please find snapshot of execution Capture2.PNG

Hi @Witold 

After trying 
table = (spark.read
.format("jdbc")
.option("url", "<jdbc-url>")
.option("dbtable", "<table-name>")
.option("user", "<username>")
.option("password", "<password>")
.option("fetchSize", "1000") -- to 50000
.load()
)

job is taking lots of time  and not even reading 10 million records.
Capture.PNG

shaza606
Visitor

Hello good man 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group