cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

PySpark on Jupyterhub K8s || Unable to query data || Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

vivek_sinha
Contributor

Pyspark Version:

2.4.5

Hive Version: 1.2

Hadoop Version: 2.7

AWS-SDK Jar: 1.7.4

Hadoop-AWS: 2.7.3

When I am trying to show data I am getting Class org.apache.hadoop.fs.s3a.S3AFileSystem not found while I am passing all the information which all are required.

I tried with all three values for this config fs.s3.aws.credentials.provider but nothing worked

  1. org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider
  2. com.amazonaws.auth.InstanceProfileCredentialsProvider
  3. com.amazonaws.auth.EnvironmentVariableCredentialsProvider

If table has no data its giving count as 0, but it fails with the table having data with errors.

Everything works fine like print_schema, show tables etc but when I try to see the data using

.show(), toPandas(), .toJSON().collect or even saving to CSV is also not working

Sample Code:

from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
 
sc._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3.aws.credentials.provider', 'com.amazonaws.auth.EnvironmentVariableCredentialsProvider')
 val = sc.sql("select * from customer.100_rating limit 5")
 val.show()

Error 1: With .show()/ .toPandas()/ .toJSON()

Py4JJavaError: An error occurred while calling o1132.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 26.0 failed 4 times, most recent failure: Lost task 1.3 in stage 26.0 (TID 498, 10.101.36.145, executor 2): java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Error 2: while saving data to csv:

Py4JJavaError: An error occurred while calling o531.csv.
: org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)

Error 3: While trying to count a specific column

Py4JJavaError: An error occurred while calling o99.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve '`testing`' given input columns: [dataplatform.testing.id, dataplatform.testing.name]; line 1 pos 13;
'Aggregate [name#247], [unresolvedalias('count('testing[name]), None)]
+- SubqueryAlias `dataplatform`.`testing`
   +- HiveTableRelation `dataplatform`.`testing`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#246, name#247]

Please help me to fix this issue its pending for a long time.

1 ACCEPTED SOLUTION

Accepted Solutions

vivek_sinha
Contributor

Hi @Arvind Ravish​ 

Thanks for the response and now I fixed the issue.

The image which I was using to launch spark executor didn't have aws jars. After doing necessary changes it started working.

But still may thanks for your response.

View solution in original post

4 REPLIES 4

User16764241763
Honored Contributor

Hello @vivek,

Could you confirm if you are running this code on the Databricks platform?

Try adding spark.jars config which includes all the dependent jars when you are initializing the spark session

SparkSession\

.builder\

.config("spark.jars", "x.jar,y.jar")\

.getOrCreate()

spark.jars

Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.

Anonymous
Not applicable

Based on the hadoop version, kubernetes, and AWS sdk, was clearly not using Databricks.

vivek_sinha
Contributor

Hi @Arvind Ravish​ 

Thanks for the response and now I fixed the issue.

The image which I was using to launch spark executor didn't have aws jars. After doing necessary changes it started working.

But still may thanks for your response.

Kaniz
Community Manager
Community Manager

Hi @Vivek Sinha​, I'm glad you've fixed the issue. Would you mind selecting the best answer as it would be helpful for the community?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.