Databricks Community

dollyb · ‎04-15-2024

Hello,

I'm using a local Docker Spark 3.5 runtime to test my Databricks Connect code. However I've come across a couple of cases where my code would work in one environment, but not the other.

Concrete example, I'm reading data from BigQuery via spark.read.format("bigquery") and the BigQuery connector 0.36.1 in my local environment. I can't seem to find out what library Databricks is using.

So when I fetch a table, the dataset has a subtly different schema which I don't understand since the table is the same.

Spark:

 |-- event_params: map (nullable = false)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- string_value: string (nullable = true)
 |    |    |-- int_value: long (nullable = true)
 |    |    |-- float_value: double (nullable = true)
 |    |    |-- double_value: double (nullable = true)

Databricks

|-- event_params: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: struct (nullable = true)
 |    |    |    |-- string_value: string (nullable = true)
 |    |    |    |-- int_value: long (nullable = true)
 |    |    |    |-- float_value: double (nullable = true)
 |    |    |    |-- double_value: double (nullable = true)

So in Databricks, it's wrapped in an extra array which makes no sense to me.

What library is Databricks using? How to handle these differences in environments?

When adding my local dependency, I get this:

Multiple sources found for bigquery (com.google.cloud.spark.bigquery.BigQueryRelationProvider, com.google.cloud.spark.bigquery.v2.Spark35BigQueryTableProvider)

daniel_sahal · ‎04-18-2024

@dollyb
That's because when you've added another dependency on Databricks, it doesn't really know which one it should use. By default it's using built-in com.google.cloud.spark.bigquery.BigQueryRelationProvider.

What you can do is provide whole package name into format(), ex.

spark.read.format("com.google.cloud.spark.bigquery.v2.Spark35BigQueryTableProvider")

View solution in original post

daniel_sahal · ‎04-18-2024

@dollyb
That's because when you've added another dependency on Databricks, it doesn't really know which one it should use. By default it's using built-in com.google.cloud.spark.bigquery.BigQueryRelationProvider.

What you can do is provide whole package name into format(), ex.

spark.read.format("com.google.cloud.spark.bigquery.v2.Spark35BigQueryTableProvider")

dollyb · ‎06-21-2024

Thanks, using the FQN works. I've now added a cluster init-script that removes the (outdated) version provided by Databricks.

Databricks Community

Differences between Spark SQL and Databricks

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon