Databricks Community

panganibana · ‎01-30-2025

We have a Catalog pointing to an External Data Source (Google BigQuery).
1) In a notebook, create a cell where it runs a query to populate a Dataframe. Display results.
2) Create another cell below and display the same Dataframe.
3) I get different results! Why? Code below. This does not happen when querying databricks tables.

##-- First cell --##
df_sample_ids = (spark.table('`catalog_external`.my_schema.my_table')
             .filter(F.col('date_created').between('2025-01-17', '2025-01-18')))

display(df_sample_ids)

##-- 2nd cell --##
display(df_sample_ids)

crystal548 · ‎01-31-2025

@panganibana wrote:
We have a Catalog pointing to an External Data Source (Google BigQuery).
1) In a notebook, create a cell where it runs a query to populate a Dataframe. Display results.
2) Create another cell below and display the same Dataframe.
3) I get different results! Why? Code below. This does not happen when querying databricks tables.

##-- First cell --##
df_sample_ids = (spark.table('`catalog_external`.my_schema.my_table')
             .filter(F.col('date_created').between('2025-01-17', '2025-01-18')))

display(df_sample_ids)

##-- 2nd cell --##
display(df_sample_ids)

The issue likely stems from Spark's caching behavior when querying external data sources like BigQuery. To ensure consistent results:

Force Re-execution: Add cache() to the first DataFrame and clear the Spark cache before executing the second cell.
Avoid Cache Dependency: If data freshness is critical, avoid relying on cached results and re-execute the query in each cell.
Check Connection Stability: Monitor the connection between your Spark cluster and BigQuery for any issues.

By implementing these measures and carefully considering caching behavior, you can ensure consistent results when querying external data sources in your Spark notebooks.

texas offender search

View solution in original post

crystal548 · ‎01-31-2025

@panganibana wrote:
We have a Catalog pointing to an External Data Source (Google BigQuery).
1) In a notebook, create a cell where it runs a query to populate a Dataframe. Display results.
2) Create another cell below and display the same Dataframe.
3) I get different results! Why? Code below. This does not happen when querying databricks tables.

##-- First cell --##
df_sample_ids = (spark.table('`catalog_external`.my_schema.my_table')
             .filter(F.col('date_created').between('2025-01-17', '2025-01-18')))

display(df_sample_ids)

##-- 2nd cell --##
display(df_sample_ids)

The issue likely stems from Spark's caching behavior when querying external data sources like BigQuery. To ensure consistent results:

Force Re-execution: Add cache() to the first DataFrame and clear the Spark cache before executing the second cell.
Avoid Cache Dependency: If data freshness is critical, avoid relying on cached results and re-execute the query in each cell.
Check Connection Stability: Monitor the connection between your Spark cluster and BigQuery for any issues.

By implementing these measures and carefully considering caching behavior, you can ensure consistent results when querying external data sources in your Spark notebooks.

texas offender search