06-10-2022 05:39 PM
Pyspark Version:
Hive Version: 1.2
Hadoop Version: 2.7
AWS-SDK Jar: 1.7.4
Hadoop-AWS: 2.7.3
When I am trying to show data I am getting Class org.apache.hadoop.fs.s3a.S3AFileSystem not found while I am passing all the information which all are required.
I tried with all three values for this config fs.s3.aws.credentials.provider but nothing worked
If table has no data its giving count as 0, but it fails with the table having data with errors.
Everything works fine like print_schema, show tables etc but when I try to see the data using
.show(), toPandas(), .toJSON().collect or even saving to CSV is also not working
Sample Code:
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sc._jsc.hadoopConfiguration().set('fs.s3.aws.credentials.provider', 'com.amazonaws.auth.EnvironmentVariableCredentialsProvider')
val = sc.sql("select * from customer.100_rating limit 5")
Error 1: With .show()/ .toPandas()/ .toJSON()
Py4JJavaError: An error occurred while calling o1132.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 26.0 failed 4 times, most recent failure: Lost task 1.3 in stage 26.0 (TID 498,, executor 2): java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
Error 2: while saving data to csv:
Py4JJavaError: An error occurred while calling o531.csv.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
Error 3: While trying to count a specific column
Py4JJavaError: An error occurred while calling o99.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve '`testing`' given input columns: [dataplatform.testing.id, dataplatform.testing.name]; line 1 pos 13;
'Aggregate [name#247], [unresolvedalias('count('testing[name]), None)]
+- SubqueryAlias `dataplatform`.`testing`
+- HiveTableRelation `dataplatform`.`testing`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#246, name#247]
Please help me to fix this issue its pending for a long time.
06-12-2022 12:49 AM
Hi @Arvind Ravish
Thanks for the response and now I fixed the issue.
The image which I was using to launch spark executor didn't have aws jars. After doing necessary changes it started working.
But still may thanks for your response.
06-11-2022 11:48 PM
Hello @vivek,
Could you confirm if you are running this code on the Databricks platform?
Try adding spark.jars config which includes all the dependent jars when you are initializing the spark session
.config("spark.jars", "x.jar,y.jar")\
Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.
06-12-2022 04:32 AM
Based on the hadoop version, kubernetes, and AWS sdk, was clearly not using Databricks.
06-13-2022 12:16 AM
Hi @Vivek Sinha, I'm glad you've fixed the issue. Would you mind selecting the best answer as it would be helpful for the community?
