4 weeks ago
Hi Team,
I am reading 60 million -80million data from mysql server and writing into ADLS in parquet format but i am getting java heap issue, GC allocation failure and out of memory issue.
below are my cluster configuration
Driver - 56GB Ram, 16 core
Worker - 56GB Ram, 16 core
autos-calling enabled with min 4 worker to max 8 worker
could you please help to resolve the issue ?
after reading the data from mysql server
df.count() is giving me the result but df.write is failing with above mentioned error
i have tried with df.repartition() from 128 to 1024 but no luck also tried salting but dit now work for df.write.parquet
4 weeks ago
team help me to resolve the problem
4 weeks ago
every time I am seeing data behaviour like this
4 weeks ago - last edited 4 weeks ago
How do you read and write the records? Which cluster size do you use?
4 weeks ago
Hi,
below is the code
driver=โcom.mysql.cj. jdbc.Driver'
database_host =โip address"
database_port=โ3306"
database_name =โdatabase"
table =โtable"
user ="userโ
password =โ0password"
url = f"jdbc:mysq]://(database_host): (database_port)/(database_name)?zeroDateTimeBehavior=CONVERT_TO_NULLโ
remote_ tablel=
spark.read
format ("idb")
.option ("driver", driver)
.option ("url", url)
.option ("query", "select * from database.table where deleted =0")
.option ("user", user)
.option ("password", password)
.option ("maxRows InMemory",
5000000)
remote_table1.write.format ("parquet"). partitionBy("name") .save("/mt/sm/process/replica/datalake/myuday/datalake/2024/09/19โ)
cluster config-
worker type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12
driver type - Standard_DS13_v2(56 ram 8 cores) min worker 8 max 12
4 weeks ago
Multiple things.
4 weeks ago
thanks for the reply
let me try the approaches which you mentioned and see the performance. will update you shortly.
4 weeks ago
Hi @Witold
Now i am able to read the data but one issue i am seeing is that out of 8 executor 3 are getting success in just 2-3 sec and rest 5 are running why this behavior.
below is the code
4 weeks ago
Hi @Witold
After trying
table = (spark.read
.format("jdbc")
.option("url", "<jdbc-url>")
.option("dbtable", "<table-name>")
.option("user", "<username>")
.option("password", "<password>")
.option("fetchSize", "1000") -- to 50000
.load()
)
job is taking lots of time and not even reading 10 million records.
4 weeks ago
Hello good man
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group