Pyspark Version:
2.4.5
Hive Version: 1.2
Hadoop Version: 2.7
AWS-SDK Jar: 1.7.4
Hadoop-AWS: 2.7.3
When I am trying to show data I am getting Class org.apache.hadoop.fs.s3a.S3AFileSystem not found while I am passing all the information which all are required.
I tried with all three values for this config fs.s3.aws.credentials.provider but nothing worked
- org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider
- com.amazonaws.auth.InstanceProfileCredentialsProvider
- com.amazonaws.auth.EnvironmentVariableCredentialsProvider
If table has no data its giving count as 0, but it fails with the table having data with errors.
Everything works fine like print_schema, show tables etc but when I try to see the data using
.show(), toPandas(), .toJSON().collect or even saving to CSV is also not working
Sample Code:
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sc._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3.aws.credentials.provider', 'com.amazonaws.auth.EnvironmentVariableCredentialsProvider')
val = sc.sql("select * from customer.100_rating limit 5")
val.show()
Error 1: With .show()/ .toPandas()/ .toJSON()
Py4JJavaError: An error occurred while calling o1132.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 26.0 failed 4 times, most recent failure: Lost task 1.3 in stage 26.0 (TID 498, 10.101.36.145, executor 2): java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
Error 2: while saving data to csv:
Py4JJavaError: An error occurred while calling o531.csv.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
Error 3: While trying to count a specific column
Py4JJavaError: An error occurred while calling o99.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve '`testing`' given input columns: [dataplatform.testing.id, dataplatform.testing.name]; line 1 pos 13;
'Aggregate [name#247], [unresolvedalias('count('testing[name]), None)]
+- SubqueryAlias `dataplatform`.`testing`
+- HiveTableRelation `dataplatform`.`testing`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#246, name#247]
Please help me to fix this issue its pending for a long time.