cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

databricks-connect throws an exception when showing a dataframe with json content

KarimSegura
New Contributor III

I'm facing an issue when I want to show a dataframe with JSON content.

All this happens when the script runs in databricks-connect from VS Code.

Basically, I would like any help or guidance to get this run as it should be.

Thanks in advance.

This is how the cluster is configured.

Cluster Azure Databricks runtime 10.4

Workers 2-8 Standard_DS3_v2 14GB Memory, 4 cores

Driver Standard_DS3_v2 14GB Memory, 4 cores

Spark config.

spark.databricks.service.server.enabled true

spark.databricks.service.port 8787

spark.hadoop.datanucleus.connectionPoolingType hikari

spark.databricks.delta.preview.enabled true

On my local computer

I installed databricks-connect using pip install databricks-connect 10.4.*

It is configured as per documentation indicates. Azure databricks-connect setup

When I run databricks-connect test passes without any failure

The code I'm trying to run is this:

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from datetime import date
 
nested_row = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}']
 
nested_struct = StructType([
StructField("address",StructType([
     StructField("city",StringType(),True),
     StructField("state",StringType(),True)
  ]),True),
  StructField("name",StringType(),True)
])
 
nested_rdd = sc.parallelize(nested_row)
 
df_json = spark.read.json(nested_rdd,nested_struct)
 
df_json.printSchema()
df_json.show()

everything runs fine until the printSchema, then when I want to show the dataframe throws an exception.

Traceback (most recent call last):
  File "c:\Data\projects\vcode\gbrx-dbconnect\dbc1.py", line 76, in <module>
    df_json.show()
  File "c:\Data\projects\vcode\gbrx-dbconnect\venv\lib\site-packages\pyspark\sql\dataframe.py", line 502, in show
    print(self._jdf.showString(n, 20, vertical))
  File "c:\Data\projects\vcode\gbrx-dbconnect\venv\lib\site-packages\py4j\java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "c:\Data\projects\vcode\gbrx-dbconnect\venv\lib\site-packages\pyspark\sql\utils.py", line 117, in deco
    return f(*a, **kw)
  File "c:\Data\projects\vcode\gbrx-dbconnect\venv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o46.showString.
: java.lang.ClassCastException: cannot assign instance of java.lang.String to field org.apache.spark.sql.catalyst.json.JSONOptions.lineSeparatorInRead of type scala.Option in instance of org.apache.spark.sql.catalyst.json.JSONOptions
        at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)
        at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2411)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
        at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
        at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
        at org.apache.spark.sql.util.ProtoSerializer.$anonfun$deserializeObject$1(ProtoSerializer.scala:7055)

2 REPLIES 2

Hubert-Dudek
Esteemed Contributor III

Your code is correct. Please execute it directly on databricks.

"Databricks recommends that you use DBX by Databricks Labs for local development instead of Databricks Connect." The main limitation is that code is executed directly on clusters and not databricks. + the fact that it is EOL.

Soon Spark Connect will be available, so our life will be easier.

KarimSegura
New Contributor III

The code works fine on databricks cluster, but this code is part of a unit test in local env. then submitted to a branch->PR->merged into master branch.

Thanks for the advice on using DBX. I will give DBX a try again even though I've already tried.

I'll be heads up for Spark Connect.

Thank you.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group