cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
cancel
Showing results for 
Search instead for 
Did you mean: 

Not loading csv files with ".c000.csv" in the name

jenshumrich
New Contributor III

Yesterday I created a ton of csv files via 

joined_df.write.partitionBy("PartitionColumn").mode("overwrite").csv(
            output_path, header=True
        )
Today, when working with them I realized, that they were not loaded. Upon investigation I saw that in the PartitionColumn folder are only a "_started_123" and a "par-00123-tic-123[.....].c000.csv" file. So no "_SUCCESS".
When renaming the csv files, they are correctly loaded. 
Now my question: What the heck is going on here? Was the writing process broken, and if so, why was this not logged? Why do the files have a ".c000.csv" ending? Why are they not loaded?

4 REPLIES 4

MichTalebzadeh
Contributor

You are likely confusing Spark with your file naming notation and partitioning. This error is likely due to an incomplete Spark write operation. your Spark job using partitioning created temporary files with ".c000.csv" extension. The missing "_SUCCESS" file suggests the write operation did not finish successfully. You may have data in Spark temporary files, but they may not have been loaded into partitions because they rely on the "_SUCCESS" marker.

IHTH

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

Let us try to simulate this error

from pyspark.sql import SparkSession
import os

# Create a SparkSession
spark = SparkSession.builder.appName("SomeTestsForIncompleteWriteSimulation").getOrCreate()

# Sample DataFrame
data = [("A", 1), ("B", 2), ("A", 3), ("C", 5)]
df = spark.createDataFrame(data, ["col1", "col2"])

# Simulate an error during write
try:
  df.write.partitionBy("col1").mode("overwrite").csv("/tmp/output", header=True)
except Exception as e:
  print("Simulating write error:", e)

# Check for existence of "_SUCCESS" file in local /tmp
success_file = "/tmp/output/_SUCCESS"
if os.path.exists(success_file):
  print("_SUCCESS file found (might not reflect reality if error occurred earlier)")
else:
  print("_SUCCESS file missing (indicates incomplete write)")

and the output

_SUCCESS file missing (indicates incomplete write)
Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

jenshumrich
New Contributor III

Thanks Mich, you are partially right and it helped a lot!
Using your code, I was able to see, that it also wrote files with ".c000.csv"  at the end. https://stackoverflow.com/questions/54190082/spark-structured-streaming-producing-c000-csv-files say...
The check, if the file is available must use 

if len(dbutils.fs.ls(success_file)) > 0:
  print("_SUCCESS file found (might not reflect reality if error occurred earlier)")
else:
  print("_SUCCESS file missing (indicates incomplete write)")
though (os checks the local file system of the master node, no?)
And even though the files end with ".c000.csv" , I was able to read them in:
test_df = spark.read.option("basePath", "/tmp/output").csv("/tmp/output/", header=True)
test_df.show()

jenshumrich_0-1710490864667.png

 

jenshumrich
New Contributor III

Then removing the "_commited_" file stops spark form reading in the other files

jenshumrich_1-1710491115337.png