cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Inconsistency on Dataframe queried from External Data Source

panganibana
New Contributor II

We have a Catalog pointing to an External Data Source (Google BigQuery).
1) In a notebook, create a cell where it runs a query to populate a Dataframe. Display results.
2) Create another cell below and display the same Dataframe.
3) I get different results!  Why?  Code below. This does not happen when querying databricks tables.

##-- First cell --##
df_sample_ids = (spark.table('`catalog_external`.my_schema.my_table')
             .filter(F.col('date_created').between('2025-01-17', '2025-01-18')))

display(df_sample_ids)

##-- 2nd cell --##
display(df_sample_ids)

 

1 ACCEPTED SOLUTION

Accepted Solutions

crystal548
New Contributor II

@panganibana wrote:

We have a Catalog pointing to an External Data Source (Google BigQuery).
1) In a notebook, create a cell where it runs a query to populate a Dataframe. Display results.
2) Create another cell below and display the same Dataframe.
3) I get different results!  Why?  Code below. This does not happen when querying databricks tables.

 

##-- First cell --##
df_sample_ids = (spark.table('`catalog_external`.my_schema.my_table')
             .filter(F.col('date_created').between('2025-01-17', '2025-01-18')))

display(df_sample_ids)

##-- 2nd cell --##
display(df_sample_ids)

 

 


The issue likely stems from Spark's caching behavior when querying external data sources like BigQuery. To ensure consistent results:

  • Force Re-execution: Add cache() to the first DataFrame and clear the Spark cache before executing the second cell.
  • Avoid Cache Dependency: If data freshness is critical, avoid relying on cached results and re-execute the query in each cell.
  • Check Connection Stability: Monitor the connection between your Spark cluster and BigQuery for any issues.

By implementing these measures and carefully considering caching behavior, you can ensure consistent results when querying external data sources in your Spark notebooks.

View solution in original post

1 REPLY 1

crystal548
New Contributor II

@panganibana wrote:

We have a Catalog pointing to an External Data Source (Google BigQuery).
1) In a notebook, create a cell where it runs a query to populate a Dataframe. Display results.
2) Create another cell below and display the same Dataframe.
3) I get different results!  Why?  Code below. This does not happen when querying databricks tables.

 

##-- First cell --##
df_sample_ids = (spark.table('`catalog_external`.my_schema.my_table')
             .filter(F.col('date_created').between('2025-01-17', '2025-01-18')))

display(df_sample_ids)

##-- 2nd cell --##
display(df_sample_ids)

 

 


The issue likely stems from Spark's caching behavior when querying external data sources like BigQuery. To ensure consistent results:

  • Force Re-execution: Add cache() to the first DataFrame and clear the Spark cache before executing the second cell.
  • Avoid Cache Dependency: If data freshness is critical, avoid relying on cached results and re-execute the query in each cell.
  • Check Connection Stability: Monitor the connection between your Spark cluster and BigQuery for any issues.

By implementing these measures and carefully considering caching behavior, you can ensure consistent results when querying external data sources in your Spark notebooks.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group