โ09-18-2024 01:52 PM
Hi Team,
I am reading 60 million -80million data from mysql server and writing into ADLS in parquet format but i am getting java heap issue, GC allocation failure and out of memory issue.
below are my cluster configuration
Driver - 56GB Ram, 16 core
Worker - 56GB Ram, 16 core
autos-calling enabled with min 4 worker to max 8 worker
could you please help to resolve the issue ?
after reading the data from mysql server
df.count() is giving me the result but df.write is failing with above mentioned error
i have tried with df.repartition() from 128 to 1024 but no luck also tried salting but dit now work for df.write.parquet
โ09-18-2024 01:54 PM
team help me to resolve the problem
โ09-18-2024 02:33 PM
every time I am seeing data behaviour like this
โ09-19-2024 01:17 AM - edited โ09-19-2024 01:17 AM
How do you read and write the records? Which cluster size do you use?
โ09-19-2024 02:18 AM
Hi,
below is the code
driver=โcom.mysql.cj. jdbc.Driver'
database_host =โip address"
database_port=โ3306"
database_name =โdatabase"
table =โtable"
user ="userโ
password =โ0password"
url = f"jdbc:mysq]://(database_host): (database_port)/(database_name)?zeroDateTimeBehavior=CONVERT_TO_NULLโ
remote_ tablel=
spark.read
format ("idb")
.option ("driver", driver)
.option ("url", url)
.option ("query", "select * from database.table where deleted =0")
.option ("user", user)
.option ("password", password)
.option ("maxRows InMemory",
5000000)
remote_table1.write.format ("parquet"). partitionBy("name") .save("/mt/sm/process/replica/datalake/myuday/datalake/2024/09/19โ)
cluster config-
worker type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12
driver type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12
โ09-19-2024 02:36 AM
Multiple things.
โ09-19-2024 02:59 AM
thanks for the reply
let me try the approaches which you mentioned and see the performance. will update you shortly.
โ09-19-2024 07:03 AM
Hi @Witold
Now i am able to read the data but one issue i am seeing is that out of 8 executor 3 are getting success in just 2-3 sec and rest 5 are running why this behavior.
below is the code
โ09-19-2024 03:53 AM
Hi @Witold
After trying
table = (spark.read
.format("jdbc")
.option("url", "<jdbc-url>")
.option("dbtable", "<table-name>")
.option("user", "<username>")
.option("password", "<password>")
.option("fetchSize", "1000") -- to 50000
.load()
)
job is taking lots of time and not even reading 10 million records.
โ09-19-2024 07:18 AM
Hello good man
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group