Hi,
Here are a list of the likely causes and some steps to remediate.
1. Table-Specific Data and File Layout Issues
- Small file problem: If the two problematic tables are comprised of many very small underlying Parquet files, Databricks may spend most of the execution time opening and closing files, leading to latency not observable in preview mode.
- Non-Delta/Parquet format: Unity Catalog documentation confirms non-Delta tables often suffer from slow partition discoveryāsimple queries like
count(*) take significantly longer due to metastore and partition lookups.
- Partitioning scheme: If these tables are heavily or inefficiently partitionedāor use partition columns with high cardinality or many small partitionsāscan times increase dramatically.
2. Connector/Networking Misconfiguration
- Dataflow tools often rely on optimized connectors and correct network permissions. Errors like āconnection reset,ā āconnection timeout,ā or firewall rules blocking VNET access can cause timeouts at runtime but may not impact the lightweight preview fetch.
3. Access Controls and Storage Account Settings
- If underlying ADLS Gen2 (Azure Data Lake Storage Gen2) permissions, hierarchical namespace, or managed identity setups differ between tables/schemas, data access retries may cause long delays. The preview typically samples small chunks and doesnāt trigger full scans, so permission or path issues may only appear on larger reads.
4. Databricks Runtime or Cluster Configuration
- Outdated Databricks Runtime versions, insufficient compute, or missing Photon acceleration can impede processing, especially for queries involving many files or requiring shuffling data.
5. Table Statistics and Query Plan Optimization
- If statistics on the two tables are missing or stale, the Databricks optimizer may not generate an efficient query plan. Lack of statistics especially impacts read speed for larger tables.
6. Metadata/Partition Discovery Overhead
- When using non-Delta tables or tables with legacy formats, Unity Catalog incurs significant overhead fetching partition information and metadata, which can cripple runtime performance on partitioned tables.
Targeted Solutions and Recommendations
A. Convert Tables to Delta Format and Optimize File Sizes
- Convert the two tables to Delta format if theyāre not already; Delta offers substantial query speed-ups by limiting metadata scan and supporting auto-compaction.
- Run the
OPTIMIZE command regularly on Delta tables to merge small files and use ZORDER BY on frequently filtered/joined columns.
B. Review Partitioning and File Layout
- Check the number, size, and distribution of files underlying these tables. Ideally, files should be between 16MB and 1GB.
- Review partition columns: use low cardinality columns, and avoid partitioning under 1TB per Databricks guidance.
- If partition count is very high, consider repartitioning with more appropriate column(s).
C. Update Table Statistics
- Run
ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS; after any large table update to aid query planning and reduce scan times.
D. Check Dataflow and Connector Configuration
- Ensure your Dataflow Gen2 uses the most efficient connector for Databricks (preferably with native Delta support).
- Confirm networking, firewall, and VNET configurations allow for rapid access to all tables (including paths, storage credentials, and managed identity assignments). Preview requests may not exercise all network paths or permissions.
E. Upgrade Databricks Runtime and Use Photon
- Use the latest Databricks Runtime for all jobs/clusters. Photon acceleration significantly boosts scan performance for SQL read and join operations.
F. Compare Schema, Partitioning, and File Layout with āWorkingā Tables
- Investigate the six working tables versus the two slow ones. Differences in file counts, partition schemes, table format (Delta vs Parquet), or statistics can reveal root causes.