09-18-2024 01:52 PM
Hi Team,
I am reading 60 million -80million data from mysql server and writing into ADLS in parquet format but i am getting java heap issue, GC allocation failure and out of memory issue.
below are my cluster configuration
Driver - 56GB Ram, 16 core
Worker - 56GB Ram, 16 core
autos-calling enabled with min 4 worker to max 8 worker
could you please help to resolve the issue ?
after reading the data from mysql server
df.count() is giving me the result but df.write is failing with above mentioned error
i have tried with df.repartition() from 128 to 1024 but no luck also tried salting but dit now work for df.write.parquet
09-18-2024 01:54 PM
team help me to resolve the problem
09-18-2024 02:33 PM
every time I am seeing data behaviour like this
09-19-2024 01:17 AM - edited 09-19-2024 01:17 AM
How do you read and write the records? Which cluster size do you use?
09-19-2024 02:18 AM
Hi,
below is the code
driver=“com.mysql.cj. jdbc.Driver'
database_host =“ip address"
database_port=“3306"
database_name =“database"
table =“table"
user ="user”
password =“0password"
url = f"jdbc:mysq]://(database_host): (database_port)/(database_name)?zeroDateTimeBehavior=CONVERT_TO_NULL”
remote_ tablel=
spark.read
format ("idb")
.option ("driver", driver)
.option ("url", url)
.option ("query", "select * from database.table where deleted =0")
.option ("user", user)
.option ("password", password)
.option ("maxRows InMemory",
5000000)
remote_table1.write.format ("parquet"). partitionBy("name") .save("/mt/sm/process/replica/datalake/myuday/datalake/2024/09/19”)
cluster config-
worker type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12
driver type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12
09-19-2024 02:36 AM
Multiple things.
09-19-2024 02:59 AM
thanks for the reply
let me try the approaches which you mentioned and see the performance. will update you shortly.
09-19-2024 07:03 AM
Hi @Witold
Now i am able to read the data but one issue i am seeing is that out of 8 executor 3 are getting success in just 2-3 sec and rest 5 are running why this behavior.
below is the code
09-19-2024 03:53 AM
Hi @Witold
After trying
table = (spark.read
.format("jdbc")
.option("url", "<jdbc-url>")
.option("dbtable", "<table-name>")
.option("user", "<username>")
.option("password", "<password>")
.option("fetchSize", "1000") -- to 50000
.load()
)
job is taking lots of time and not even reading 10 million records.
09-19-2024 07:18 AM
Hello good man
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group