Databricks

vivek_sinha · ‎06-10-2022

Pyspark Version:

2.4.5

Hive Version: 1.2

Hadoop Version: 2.7

AWS-SDK Jar: 1.7.4

Hadoop-AWS: 2.7.3

When I am trying to show data I am getting Class org.apache.hadoop.fs.s3a.S3AFileSystem not found while I am passing all the information which all are required.

I tried with all three values for this config fs.s3.aws.credentials.provider but nothing worked

org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider
com.amazonaws.auth.InstanceProfileCredentialsProvider
com.amazonaws.auth.EnvironmentVariableCredentialsProvider

If table has no data its giving count as 0, but it fails with the table having data with errors.

Everything works fine like print_schema, show tables etc but when I try to see the data using

.show(), toPandas(), .toJSON().collect or even saving to CSV is also not working

Sample Code:

from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
 
sc._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3.aws.credentials.provider', 'com.amazonaws.auth.EnvironmentVariableCredentialsProvider')
 val = sc.sql("select * from customer.100_rating limit 5")
 val.show()

Error 1: With .show()/ .toPandas()/ .toJSON()

Py4JJavaError: An error occurred while calling o1132.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 26.0 failed 4 times, most recent failure: Lost task 1.3 in stage 26.0 (TID 498, 10.101.36.145, executor 2): java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Error 2: while saving data to csv:

Py4JJavaError: An error occurred while calling o531.csv.
: org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)

Error 3: While trying to count a specific column

Py4JJavaError: An error occurred while calling o99.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve '`testing`' given input columns: [dataplatform.testing.id, dataplatform.testing.name]; line 1 pos 13;
'Aggregate [name#247], [unresolvedalias('count('testing[name]), None)]
+- SubqueryAlias `dataplatform`.`testing`
   +- HiveTableRelation `dataplatform`.`testing`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#246, name#247]

Please help me to fix this issue its pending for a long time.

vivek_sinha · ‎06-12-2022

Hi @Arvind Ravish

Thanks for the response and now I fixed the issue.

The image which I was using to launch spark executor didn't have aws jars. After doing necessary changes it started working.

But still may thanks for your response.

View solution in original post

User16764241763 · ‎06-11-2022

Hello @vivek,

Could you confirm if you are running this code on the Databricks platform?

Try adding spark.jars config which includes all the dependent jars when you are initializing the spark session

SparkSession\

.builder\

.config("spark.jars", "x.jar,y.jar")\

.getOrCreate()

spark.jars

Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.

Anonymous · ‎06-12-2022

Based on the hadoop version, kubernetes, and AWS sdk, was clearly not using Databricks.

vivek_sinha · ‎06-12-2022

Hi @Arvind Ravish

Thanks for the response and now I fixed the issue.

The image which I was using to launch spark executor didn't have aws jars. After doing necessary changes it started working.

But still may thanks for your response.

Kaniz · ‎06-13-2022

Hi @Vivek Sinha, I'm glad you've fixed the issue. Would you mind selecting the best answer as it would be helpful for the community?

Databricks

PySpark on Jupyterhub K8s || Unable to query data || Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI