databricks-connect throws an exception when showing a dataframe with json content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-12-2022 08:07 AM
I'm facing an issue when I want to show a dataframe with JSON content.
All this happens when the script runs in databricks-connect from VS Code.
Basically, I would like any help or guidance to get this run as it should be.
Thanks in advance.
This is how the cluster is configured.
Cluster Azure Databricks runtime 10.4
Workers 2-8 Standard_DS3_v2 14GB Memory, 4 cores
Driver Standard_DS3_v2 14GB Memory, 4 cores
Spark config.
spark.databricks.service.server.enabled true
spark.databricks.service.port 8787
spark.hadoop.datanucleus.connectionPoolingType hikari
spark.databricks.delta.preview.enabled true
On my local computer
I installed databricks-connect using pip install databricks-connect 10.4.*
It is configured as per documentation indicates. Azure databricks-connect setup
When I run databricks-connect test passes without any failure
The code I'm trying to run is this:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from datetime import date
nested_row = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}']
nested_struct = StructType([
StructField("address",StructType([
StructField("city",StringType(),True),
StructField("state",StringType(),True)
]),True),
StructField("name",StringType(),True)
])
nested_rdd = sc.parallelize(nested_row)
df_json = spark.read.json(nested_rdd,nested_struct)
df_json.printSchema()
df_json.show()
everything runs fine until the printSchema, then when I want to show the dataframe throws an exception.
Traceback (most recent call last):
File "c:\Data\projects\vcode\gbrx-dbconnect\dbc1.py", line 76, in <module>
df_json.show()
File "c:\Data\projects\vcode\gbrx-dbconnect\venv\lib\site-packages\pyspark\sql\dataframe.py", line 502, in show
print(self._jdf.showString(n, 20, vertical))
File "c:\Data\projects\vcode\gbrx-dbconnect\venv\lib\site-packages\py4j\java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "c:\Data\projects\vcode\gbrx-dbconnect\venv\lib\site-packages\pyspark\sql\utils.py", line 117, in deco
return f(*a, **kw)
File "c:\Data\projects\vcode\gbrx-dbconnect\venv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o46.showString.
: java.lang.ClassCastException: cannot assign instance of java.lang.String to field org.apache.spark.sql.catalyst.json.JSONOptions.lineSeparatorInRead of type scala.Option in instance of org.apache.spark.sql.catalyst.json.JSONOptions
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2411)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.sql.util.ProtoSerializer.$anonfun$deserializeObject$1(ProtoSerializer.scala:7055)
- Labels:
-
Azure databricks
-
Databricks-connect
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-12-2022 10:13 AM
Your code is correct. Please execute it directly on databricks.
"Databricks recommends that you use DBX by Databricks Labs for local development instead of Databricks Connect." The main limitation is that code is executed directly on clusters and not databricks. + the fact that it is EOL.
Soon Spark Connect will be available, so our life will be easier.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-12-2022 11:41 AM
The code works fine on databricks cluster, but this code is part of a unit test in local env. then submitted to a branch->PR->merged into master branch.
Thanks for the advice on using DBX. I will give DBX a try again even though I've already tried.
I'll be heads up for Spark Connect.
Thank you.

