Differences between Spark SQL and Databricks

dollyb — Mon, 15 Apr 2024 15:57:42 GMT

Hello,

I'm using a local Docker Spark 3.5 runtime to test my Databricks Connect code. However I've come across a couple of cases where my code would work in one environment, but not the other.

Concrete example, I'm reading data from BigQuery via spark.read.format("bigquery") and the BigQuery connector 0.36.1 in my local environment. I can't seem to find out what library Databricks is using.

So when I fetch a table, the dataset has a subtly different schema which I don't understand since the table is the same.

Spark:

Databricks

So in Databricks, it's wrapped in an extra array which makes no sense to me.

What library is Databricks using? How to handle these differences in environments?

When adding my local dependency, I get this:

Multiple sources found for bigquery (com.google.cloud.spark.bigquery.BigQueryRelationProvider, com.google.cloud.spark.bigquery.v2.Spark35BigQueryTableProvider)

Re: Differences between Spark SQL and Databricks

daniel_sahal — Fri, 19 Apr 2024 05:41:26 GMT

@dollyb
That's because when you've added another dependency on Databricks, it doesn't really know which one it should use. By default it's using built-in com.google.cloud.spark.bigquery.BigQueryRelationProvider.

What you can do is provide whole package name into format(), ex.

spark.read.format("com.google.cloud.spark.bigquery.v2.Spark35BigQueryTableProvider")

Re: Differences between Spark SQL and Databricks

dollyb — Fri, 21 Jun 2024 17:54:15 GMT

Thanks, using the FQN works. I've now added a cluster init-script that removes the (outdated) version provided by Databricks.

topic Re: Differences between Spark SQL and Databricks in Data Engineering

Differences between Spark SQL and Databricks

Re: Differences between Spark SQL and Databricks

Re: Differences between Spark SQL and Databricks