Java heap issue, GC allocation failure while writing data from mysql to adls
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-18-2024 01:52 PM
Hi Team,
I am reading 60 million -80million data from mysql server and writing into ADLS in parquet format but i am getting java heap issue, GC allocation failure and out of memory issue.
below are my cluster configuration
Driver - 56GB Ram, 16 core
Worker - 56GB Ram, 16 core
autos-calling enabled with min 4 worker to max 8 worker
could you please help to resolve the issue ?
after reading the data from mysql server
df.count() is giving me the result but df.write is failing with above mentioned error
i have tried with df.repartition() from 128 to 1024 but no luck also tried salting but dit now work for df.write.parquet
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-18-2024 01:54 PM
team help me to resolve the problem
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-18-2024 02:33 PM
every time I am seeing data behaviour like this
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-19-2024 01:17 AM - edited 09-19-2024 01:17 AM
How do you read and write the records? Which cluster size do you use?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-19-2024 02:18 AM
Hi,
below is the code
driver=“com.mysql.cj. jdbc.Driver'
database_host =“ip address"
database_port=“3306"
database_name =“database"
table =“table"
user ="user”
password =“0password"
url = f"jdbc:mysq]://(database_host): (database_port)/(database_name)?zeroDateTimeBehavior=CONVERT_TO_NULL”
remote_ tablel=
spark.read
format ("idb")
.option ("driver", driver)
.option ("url", url)
.option ("query", "select * from database.table where deleted =0")
.option ("user", user)
.option ("password", password)
.option ("maxRows InMemory",
5000000)
remote_table1.write.format ("parquet"). partitionBy("name") .save("/mt/sm/process/replica/datalake/myuday/datalake/2024/09/19”)
cluster config-
worker type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12
driver type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-19-2024 02:36 AM
Multiple things.
- First very obvious thing: "5000000" - It's not surprising that you run OOM when loading such a huge amount of records.
- The DBR has a built-in MySQL driver, use it instead
- Use fetchSize to control the number of records. Start e.g. with 1000 and measure the performance. Adapt it if needed
- If the source table has a partition column, use it as described here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-19-2024 02:59 AM
thanks for the reply
let me try the approaches which you mentioned and see the performance. will update you shortly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-19-2024 07:03 AM
Hi @Witold
Now i am able to read the data but one issue i am seeing is that out of 8 executor 3 are getting success in just 2-3 sec and rest 5 are running why this behavior.
below is the code
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-19-2024 03:53 AM
Hi @Witold
After trying
table = (spark.read
.format("jdbc")
.option("url", "<jdbc-url>")
.option("dbtable", "<table-name>")
.option("user", "<username>")
.option("password", "<password>")
.option("fetchSize", "1000") -- to 50000
.load()
)
job is taking lots of time and not even reading 10 million records.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-19-2024 07:18 AM
Hello good man

