Re: I am running simple count and I am getting an ...

miklos · ‎10-26-2015

Could you send a screenshot of what you see in the Spark UI?

You should see this text: "Failed Jobs (1)"

Click on the link in the "Description" field twice to see the # of times this executor has run.

I only see a count() and take(1) being called on the dataset, which does not perform any validations against a schema you provided. Count() just counts the # of records and take(1) just returns a row.

This is the error:

org.apache.spark.SparkException: Task failed while writing rows.
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:250)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$anonfun$run$1$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$anonfun$run$1$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException: Trying to write more fields than contained in row (554 > 246)
    at org.apache.spark.sql.execution.datasources.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:261)
    at org.apache.spark.sql.execution.datasources.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:257)
    at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
    at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
    at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.writeInternal(ParquetRelation.scala:99)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:242)
    ... 8 more

I just added this code to your notebook to show you that the dataset does not have the same number of elements:

count = Data.map(lambda x: len(x)).distinct().collect()

print count

(1) Spark Jobs

Job 5View
(Stages: 2/2)

[1, 554, 555, 560, 309, 246, 89, 221] Regarding error handling, this is up to you on how to determine if you have bad records that you want to recover later on, or maybe its a parsing error within the code above. This requires more understanding of your use case, but look over the information provided to understand where this could be happening in your code.

View solution in original post