cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks job stalls without error, unable to pin-point error, all compute metrics seem ok

a-sky
New Contributor

I have a job that gets stuck on "Determining DBIO File fragment" and I have not been able to figure out why this job keeps getting stuck. I monitor the job cluster metrics through out the job and it doesnt seem like its hitting any bottlenecks with memory or CPU. stderr log doesnt have anything other than below. stdout doesnt seem to have anything other than normal GC messages. Log4j doesnt seem to show anything useful either other than some repeating messages.

As far as the job ontent goes I am making sure to cache data using "CACHE TABLE" command to cache to memory all tables I am using frequently. I also have io caching enabled in cluster spark config.

Cluster config:

asky_0-1721405223209.png

Spark config:

 

 

spark.databricks.delta.schema.autoMerge.enabled true
spark.databricks.io.cache.enabled true
spark.databricks.io.cache.compression.enabled false
spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 5g
spark.sql.maxPlanStringLength 0
spark.rpc.message.maxSize 2047

 

 

I've tried various different config options as well as removing all or some of the configs above and it has not helped.

stderr output:

 

 

ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
Thu Jul 18 19:38:02 2024 Connection to spark from PID  1553
Thu Jul 18 19:38:02 2024 Initialized gateway on port 38455
Thu Jul 18 19:38:02 2024 Connected to spark.
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3

 

 

stdout before the job gets stuck:

 

 

2024-07-18T20:04:46.874+0000: [GC (Allocation Failure) [PSYoungGen: 29845498K->2861268K(31418880K)] 37713964K->10814195K(104044032K), 0.4936411 secs] [Times: user=3.23 sys=0.04, real=0.49 secs] 
2024-07-18T20:05:27.814+0000: [GC (Allocation Failure) [PSYoungGen: 29386964K->3123398K(31568384K)] 37339891K->11182996K(104193536K), 0.4979119 secs] [Times: user=3.71 sys=0.06, real=0.50 secs] 
2024-07-18T20:06:59.330+0000: [GC (System.gc()) [PSYoungGen: 6139623K->1638527K(31544320K)] 14199220K->9832312K(104169472K), 0.3589462 secs] [Times: user=3.95 sys=0.09, real=0.36 secs] 
2024-07-18T20:06:59.689+0000: [Full GC (System.gc()) [PSYoungGen: 1638527K->0K(31544320K)] [ParOldGen: 8193785K->5458479K(72625152K)] 9832312K->5458479K(104169472K), [Metaspace: 440360K->439660K(473088K)], 3.7700433 secs] [Times: user=42.07 sys=0.81, real=3.77 secs] 

 

 

 The following messaged start repeating in Log4j when the job gets stuck:

 

 

24/07/18 20:16:57 INFO DataSourceFactory$: DataSource Jdbc URL: jdbc:mariadb://REDACTED?useSSL=true&sslMode=VERIFY_CA&disableSslHostnameVerification=true&trustServerCertificate=false&serverSslCert=/databricks/common/mysql-ssl-ca-cert.crt
24/07/18 20:16:57 INFO HikariDataSource: metastore-monitor - Starting...
24/07/18 20:16:57 INFO HikariDataSource: metastore-monitor - Start completed.
24/07/18 20:16:58 INFO HikariDataSource: metastore-monitor - Shutdown initiated...
24/07/18 20:16:58 INFO HikariDataSource: metastore-monitor - Shutdown completed.
24/07/18 20:16:58 INFO MetastoreMonitor: Metastore healthcheck successful (connection duration = 490 milliseconds)
24/07/18 20:17:13 WARN DynamicSparkConfContextImpl: Ignored update because id 1721331200825 < 1721331200825; source: ConfigFile
24/07/18 20:17:13 INFO DriverCorral: Received SAFEr configs with version 1721331200825
24/07/18 20:17:13 INFO DriverCorral: DBFS health check ok
24/07/18 20:17:13 INFO HiveMetaStore: 23: get_database: default
24/07/18 20:17:13 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_database: default	
24/07/18 20:17:13 INFO DriverCorral: Metastore health check ok
24/07/18 20:18:07 INFO ProgressReporter$: Reporting partial results for running commands: 9078102453619510407_5076078769577508878_job-493768487279814-run-211419086049755-action-505228037042290 [429 occurrences]
24/07/18 20:18:13 WARN DynamicSparkConfContextImpl: Ignored update because id 1721331200825 < 1721331200825; source: ConfigFile
24/07/18 20:18:13 INFO DriverCorral: Received SAFEr configs with version 1721331200825
24/07/18 20:18:28 INFO RocksDBExternalMap: [Native-1] [/db_impl/db_impl.cc:1106] ------- DUMPING STATS -------
24/07/18 20:18:28 INFO RocksDBExternalMap: [Native-1] [/db_impl/db_impl.cc:1107] 
** DB Stats **
Uptime(secs): 2400.0 total, 600.0 interval
Cumulative writes: 8859 writes, 8859 keys, 8859 commit groups, 1.0 writes per commit group, ingest: 0.00 GB, 0.00 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit group, ingest: 0.00 MB, 0.00 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sum      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0

** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Blob file count: 0, total size: 0.0 GB, garbage size: 0.0 GB, space amp: 0.0

Uptime(secs): 2400.0 total, 600.0 interval
Flush(GB): cumulative 0.000, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Block cache LRUCache@0x7f882cb0b3b0#803 capacity: 8.00 MB usage: 0.09 KB table_size: 256 occupancy: 88 collections: 5 last_copies: 0 last_secs: 5.6e-05 secs_since: 0
Block cache entry stats(count,size,portion): Misc(1,0.00 KB,0%)

** File Read Latency Histogram By Level [default] **

 

 

 

Compute metrics

asky_1-1721404695718.pngasky_2-1721404734997.pngasky_3-1721404753865.pngasky_4-1721404764486.png

Please let me know if any additional information is needed.

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @a-sky, This message indicates that Databricks is figuring out which file fragments are cached, which can be slow, especially with frequent cluster scaling. To address this, you can try disabling delta caching with `spark.conf.set("spark.databricks.io.cache.enabled", "false")`, optimizing your tables with `OPTIMIZE [table_name]`, upgrading to a newer Databricks Runtime (DBR) version, stabilizing your cluster scaling, and reviewing your job configuration. Let me know if you need any more help or if there’s anything else you’d like to discuss!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group