Re: I am running simple count and I am getting an ...

miklos · ‎10-26-2015

Looking at the executor logs and failed tasks on your cluster, the issue is with how you're attempting to write the parquet files out. The failed tasks writes a partial file out, and re-running the failed tasks causes the IOException that the file already exists.

You can see the error by going to the Spark UI -> Failed Tasks -> View Details to see the first executor task that failed.

The job has a very long schema defined and it isn't matching the input data, which is causing the failure. You would have to clean up the data, or add error handling while converting to a DataFrame before writing out to Parquet.