cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Differences between Spark SQL and Databricks

dollyb
Contributor

Hello,

I'm using a local Docker Spark 3.5 runtime to test my Databricks Connect code. However I've come across a couple of cases where my code would work in one environment, but not the other.

Concrete example, I'm reading data from BigQuery via spark.read.format("bigquery") and the BigQuery connector 0.36.1 in my local environment. I can't seem to find out what library Databricks is using.

So when I fetch a table, the dataset has a subtly different schema which I don't understand since the table is the same.

Spark:

 

 |-- event_params: map (nullable = false)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- string_value: string (nullable = true)
 |    |    |-- int_value: long (nullable = true)
 |    |    |-- float_value: double (nullable = true)
 |    |    |-- double_value: double (nullable = true)

 

Databricks

 

|-- event_params: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: struct (nullable = true)
 |    |    |    |-- string_value: string (nullable = true)
 |    |    |    |-- int_value: long (nullable = true)
 |    |    |    |-- float_value: double (nullable = true)
 |    |    |    |-- double_value: double (nullable = true)

 

 

 

So in Databricks, it's wrapped in an extra array which makes no sense to me.

What library is Databricks using? How to handle these differences in environments?

When adding my local dependency, I get this:

Multiple sources found for bigquery (com.google.cloud.spark.bigquery.BigQueryRelationProvider, com.google.cloud.spark.bigquery.v2.Spark35BigQueryTableProvider)

1 ACCEPTED SOLUTION

Accepted Solutions

daniel_sahal
Esteemed Contributor

@dollyb 
That's because when you've added another dependency on Databricks, it doesn't really know which one it should use. By default it's using built-in com.google.cloud.spark.bigquery.BigQueryRelationProvider.

What you can do is provide whole package name into format(), ex.

spark.read.format("com.google.cloud.spark.bigquery.v2.Spark35BigQueryTableProvider")

 

View solution in original post

2 REPLIES 2

daniel_sahal
Esteemed Contributor

@dollyb 
That's because when you've added another dependency on Databricks, it doesn't really know which one it should use. By default it's using built-in com.google.cloud.spark.bigquery.BigQueryRelationProvider.

What you can do is provide whole package name into format(), ex.

spark.read.format("com.google.cloud.spark.bigquery.v2.Spark35BigQueryTableProvider")

 

Thanks, using the FQN works. I've now added a cluster init-script that removes the (outdated) version provided by Databricks.

 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!