I am trying to read data from ElasticSearch(ES Version 8.5.2) using PySpark on Databricks (13.0 (includes Apache Spark 3.4.0, Scala 2.12)). The ecosystem is on AWS.
I am able to run a curl command on the Databricks notebook to the ES ip:port and fetch the data. (Which tells me the access is available )
But, unable to do the read the same ES through PySpark.
Below is the code
Jars
org.elasticsearch:elasticsearch-spark-30_2.12:8.5.2
org.elasticsearch:elasticsearch-hadoop:8.5.2
------------------
df = (spark.read
.format("org.elasticsearch.spark.sql" )
.option("spark.es.nodes.wan.only","true" )
.option("spark.es.nodes","es01-nonprod.office.io" )
#.option("es.net.ssl", "true")
.option("spark.es.net.http.auth.user", username)
.option("spark.es.net.http.auth.pass", password)
.option("spark.es.port",port)
#.option("es.net.ssl.protocol", "https")
.option("spark.es.nodes.discovery", "false")
#.option("es.nodes.client.only", "false")
#.option("spark.es.scheme", "https")
#.option("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
#.option("spark.es.http.timeout", "10m")
#.option("es.net.ssl.keystore.type","CRT")
#.option("es.net.ssl.truststore.location","/etc/ssl/certs/ca-certificates.crt")
.load( f"{index}" )
)
display(df)
----------------
Error screenshot
Curl command works just fine
I've tried
adding all the spark configurations during the cluster creation.
changing jars to org.elasticsearch:elasticsearch-hadoop:8.5.2
Resolution will be appreciated.