09-18-2024 01:52 PM
Hi Team,
I am reading 60 million -80million data from mysql server and writing into ADLS in parquet format but i am getting java heap issue, GC allocation failure and out of memory issue.
below are my cluster configuration
Driver - 56GB Ram, 16 core
Worker - 56GB Ram, 16 core
autos-calling enabled with min 4 worker to max 8 worker
could you please help to resolve the issue ?
after reading the data from mysql server
df.count() is giving me the result but df.write is failing with above mentioned error
i have tried with df.repartition() from 128 to 1024 but no luck also tried salting but dit now work for df.write.parquet
09-18-2024 01:54 PM
team help me to resolve the problem
09-18-2024 02:33 PM
every time I am seeing data behaviour like this
09-19-2024 01:17 AM - edited 09-19-2024 01:17 AM
How do you read and write the records? Which cluster size do you use?
09-19-2024 02:18 AM
Hi,
below is the code
driver=“com.mysql.cj. jdbc.Driver'
database_host =“ip address"
database_port=“3306"
database_name =“database"
table =“table"
user ="user”
password =“0password"
url = f"jdbc:mysq]://(database_host): (database_port)/(database_name)?zeroDateTimeBehavior=CONVERT_TO_NULL”
remote_ tablel=
spark.read
format ("idb")
.option ("driver", driver)
.option ("url", url)
.option ("query", "select * from database.table where deleted =0")
.option ("user", user)
.option ("password", password)
.option ("maxRows InMemory",
5000000)
remote_table1.write.format ("parquet"). partitionBy("name") .save("/mt/sm/process/replica/datalake/myuday/datalake/2024/09/19”)
cluster config-
worker type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12
driver type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12
09-19-2024 02:36 AM
Multiple things.
09-19-2024 02:59 AM
thanks for the reply
let me try the approaches which you mentioned and see the performance. will update you shortly.
09-19-2024 07:03 AM
Hi @Witold
Now i am able to read the data but one issue i am seeing is that out of 8 executor 3 are getting success in just 2-3 sec and rest 5 are running why this behavior.
below is the code
09-19-2024 03:53 AM
Hi @Witold
After trying
table = (spark.read
.format("jdbc")
.option("url", "<jdbc-url>")
.option("dbtable", "<table-name>")
.option("user", "<username>")
.option("password", "<password>")
.option("fetchSize", "1000") -- to 50000
.load()
)
job is taking lots of time and not even reading 10 million records.
09-19-2024 07:18 AM
Hello good man
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now