I'm facing an issue when I want to show a dataframe with JSON content.
All this happens when the script runs in databricks-connect from VS Code.
Basically, I would like any help or guidance to get this run as it should be.
Thanks in advance.
This is how the cluster is configured.
Cluster Azure Databricks runtime 10.4
Workers 2-8 Standard_DS3_v2 14GB Memory, 4 cores
Driver Standard_DS3_v2 14GB Memory, 4 cores
Spark config.
spark.databricks.service.server.enabled true
spark.databricks.service.port 8787
spark.hadoop.datanucleus.connectionPoolingType hikari
spark.databricks.delta.preview.enabled true
On my local computer
I installed databricks-connect using pip install databricks-connect 10.4.*
It is configured as per documentation indicates. Azure databricks-connect setup
When I run databricks-connect test passes without any failure
The code I'm trying to run is this:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from datetime import date
nested_row = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}']
nested_struct = StructType([
StructField("address",StructType([
StructField("city",StringType(),True),
StructField("state",StringType(),True)
]),True),
StructField("name",StringType(),True)
])
nested_rdd = sc.parallelize(nested_row)
df_json = spark.read.json(nested_rdd,nested_struct)
df_json.printSchema()
df_json.show()
everything runs fine until the printSchema, then when I want to show the dataframe throws an exception.
Traceback (most recent call last):
File "c:\Data\projects\vcode\gbrx-dbconnect\dbc1.py", line 76, in <module>
df_json.show()
File "c:\Data\projects\vcode\gbrx-dbconnect\venv\lib\site-packages\pyspark\sql\dataframe.py", line 502, in show
print(self._jdf.showString(n, 20, vertical))
File "c:\Data\projects\vcode\gbrx-dbconnect\venv\lib\site-packages\py4j\java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "c:\Data\projects\vcode\gbrx-dbconnect\venv\lib\site-packages\pyspark\sql\utils.py", line 117, in deco
return f(*a, **kw)
File "c:\Data\projects\vcode\gbrx-dbconnect\venv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o46.showString.
: java.lang.ClassCastException: cannot assign instance of java.lang.String to field org.apache.spark.sql.catalyst.json.JSONOptions.lineSeparatorInRead of type scala.Option in instance of org.apache.spark.sql.catalyst.json.JSONOptions
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2411)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.sql.util.ProtoSerializer.$anonfun$deserializeObject$1(ProtoSerializer.scala:7055)