- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
We have a Catalog pointing to an External Data Source (Google BigQuery).
1) In a notebook, create a cell where it runs a query to populate a Dataframe. Display results.
2) Create another cell below and display the same Dataframe.
3) I get different results! Why? Code below. This does not happen when querying databricks tables.
##-- First cell --##
df_sample_ids = (spark.table('`catalog_external`.my_schema.my_table')
.filter(F.col('date_created').between('2025-01-17', '2025-01-18')))
display(df_sample_ids)
##-- 2nd cell --##
display(df_sample_ids)
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
@panganibana wrote:We have a Catalog pointing to an External Data Source (Google BigQuery).
1) In a notebook, create a cell where it runs a query to populate a Dataframe. Display results.
2) Create another cell below and display the same Dataframe.
3) I get different results! Why? Code below. This does not happen when querying databricks tables.
##-- First cell --##
df_sample_ids = (spark.table('`catalog_external`.my_schema.my_table')
.filter(F.col('date_created').between('2025-01-17', '2025-01-18')))
display(df_sample_ids)
##-- 2nd cell --##
display(df_sample_ids)
The issue likely stems from Spark's caching behavior when querying external data sources like BigQuery. To ensure consistent results:
- Force Re-execution: Add cache() to the first DataFrame and clear the Spark cache before executing the second cell.
- Avoid Cache Dependency: If data freshness is critical, avoid relying on cached results and re-execute the query in each cell.
- Check Connection Stability: Monitor the connection between your Spark cluster and BigQuery for any issues.
By implementing these measures and carefully considering caching behavior, you can ensure consistent results when querying external data sources in your Spark notebooks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
@panganibana wrote:We have a Catalog pointing to an External Data Source (Google BigQuery).
1) In a notebook, create a cell where it runs a query to populate a Dataframe. Display results.
2) Create another cell below and display the same Dataframe.
3) I get different results! Why? Code below. This does not happen when querying databricks tables.
##-- First cell --##
df_sample_ids = (spark.table('`catalog_external`.my_schema.my_table')
.filter(F.col('date_created').between('2025-01-17', '2025-01-18')))
display(df_sample_ids)
##-- 2nd cell --##
display(df_sample_ids)
The issue likely stems from Spark's caching behavior when querying external data sources like BigQuery. To ensure consistent results:
- Force Re-execution: Add cache() to the first DataFrame and clear the Spark cache before executing the second cell.
- Avoid Cache Dependency: If data freshness is critical, avoid relying on cached results and re-execute the query in each cell.
- Check Connection Stability: Monitor the connection between your Spark cluster and BigQuery for any issues.
By implementing these measures and carefully considering caching behavior, you can ensure consistent results when querying external data sources in your Spark notebooks.

