cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Dataflow Gen2 Timeout When Loading Databricks Tables

viniciusmartins
Visitor

 

I created a Dataflow Gen2 to get data from Databricks. I can see the preview data very quickly (around 5 seconds). But when I run the dataflow, it takes 8 hours and then cancels with a timeout. I’m trying to get 8 tables with the same schema. Six of them work fine with no problems, but with two of them I’m experiencing the issue I just described. The table sizes are around 50 MB.

What can I do to solve this issue?

4 REPLIES 4

emma_s
Databricks Employee
Databricks Employee

Hi,

Here are a list of the likely causes and some steps to remediate.

1. Table-Specific Data and File Layout Issues

  • Small file problem: If the two problematic tables are comprised of many very small underlying Parquet files, Databricks may spend most of the execution time opening and closing files, leading to latency not observable in preview mode.
  • Non-Delta/Parquet format: Unity Catalog documentation confirms non-Delta tables often suffer from slow partition discovery—simple queries like count(*) take significantly longer due to metastore and partition lookups.
  • Partitioning scheme: If these tables are heavily or inefficiently partitioned—or use partition columns with high cardinality or many small partitions—scan times increase dramatically.

2. Connector/Networking Misconfiguration

  • Dataflow tools often rely on optimized connectors and correct network permissions. Errors like ā€œconnection reset,ā€ ā€œconnection timeout,ā€ or firewall rules blocking VNET access can cause timeouts at runtime but may not impact the lightweight preview fetch.

3. Access Controls and Storage Account Settings

  • If underlying ADLS Gen2 (Azure Data Lake Storage Gen2) permissions, hierarchical namespace, or managed identity setups differ between tables/schemas, data access retries may cause long delays. The preview typically samples small chunks and doesn’t trigger full scans, so permission or path issues may only appear on larger reads.

4. Databricks Runtime or Cluster Configuration

  • Outdated Databricks Runtime versions, insufficient compute, or missing Photon acceleration can impede processing, especially for queries involving many files or requiring shuffling data.
     

5. Table Statistics and Query Plan Optimization

  • If statistics on the two tables are missing or stale, the Databricks optimizer may not generate an efficient query plan. Lack of statistics especially impacts read speed for larger tables.

6. Metadata/Partition Discovery Overhead

  • When using non-Delta tables or tables with legacy formats, Unity Catalog incurs significant overhead fetching partition information and metadata, which can cripple runtime performance on partitioned tables.

Targeted Solutions and Recommendations

A. Convert Tables to Delta Format and Optimize File Sizes

  • Convert the two tables to Delta format if they’re not already; Delta offers substantial query speed-ups by limiting metadata scan and supporting auto-compaction.
  • Run the OPTIMIZE command regularly on Delta tables to merge small files and use ZORDER BY on frequently filtered/joined columns.

B. Review Partitioning and File Layout

  • Check the number, size, and distribution of files underlying these tables. Ideally, files should be between 16MB and 1GB.
  • Review partition columns: use low cardinality columns, and avoid partitioning under 1TB per Databricks guidance.
  • If partition count is very high, consider repartitioning with more appropriate column(s).

C. Update Table Statistics

  • Run ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS; after any large table update to aid query planning and reduce scan times.

D. Check Dataflow and Connector Configuration

  • Ensure your Dataflow Gen2 uses the most efficient connector for Databricks (preferably with native Delta support).
  • Confirm networking, firewall, and VNET configurations allow for rapid access to all tables (including paths, storage credentials, and managed identity assignments). Preview requests may not exercise all network paths or permissions.

E. Upgrade Databricks Runtime and Use Photon

  • Use the latest Databricks Runtime for all jobs/clusters. Photon acceleration significantly boosts scan performance for SQL read and join operations.

F. Compare Schema, Partitioning, and File Layout with ā€œWorkingā€ Tables

  • Investigate the six working tables versus the two slow ones. Differences in file counts, partition schemes, table format (Delta vs Parquet), or statistics can reveal root causes.

viniciusmartins
Visitor

Our data engineering team already worked in theses actions. It worked when I filtered the tables in Microsoft Fabriq using the Power Query below:

Table.SelectRows(
Table.SelectColumns( #"Navigation 2"{[Name = "defects", Kind = "Table"]}[Data], {"column1", "column2", "column3", "column4", "column5", "column6", "column7", "column8", "column9", "column10", "column11", "column12", "column13", "column14", "column15", "column16", "column17"} ),
each [station] = "North" or [station] = "South"
)

Do you know why can't I get the whole table??

emma_s
Databricks Employee
Databricks Employee

My suspicion is it's timing out as the data is not well optimized or too big to retrieve. When you filter down it makes it easier to read the data.

viniciusmartins
Visitor

Do you think the Databricks cluster that Microsoft Fabric is connected to needs more capacity?