topic Re: Not loading csv files with ".c000.csv" in the name in Get Started Discussions

Not loading csv files with ".c000.csv" in the name

jenshumrich — Thu, 14 Mar 2024 12:42:48 GMT

Yesterday I created a ton of csv files via

joined_df.write.partitionBy("PartitionColumn").mode("overwrite").csv(

output_path, header=True

)
Today, when working with them I realized, that they were not loaded. Upon investigation I saw that in the PartitionColumn folder are only a "_started_123" and a "par-00123-tic-123[.....].c000.csv" file. So no "_SUCCESS".
When renaming the csv files, they are correctly loaded.
Now my question: What the heck is going on here? Was the writing process broken, and if so, why was this not logged? Why do the files have a ".c000.csv" ending? Why are they not loaded?

Re: Not loading csv files with ".c000.csv" in the name

MichTalebzadeh — Thu, 14 Mar 2024 15:45:08 GMT

You are likely confusing Spark with your file naming notation and partitioning. This error is likely due to an incomplete Spark write operation. your Spark job using partitioning created temporary files with ".c000.csv" extension. The missing "_SUCCESS" file suggests the write operation did not finish successfully. You may have data in Spark temporary files, but they may not have been loaded into partitions because they rely on the "_SUCCESS" marker.

IHTH

Re: Not loading csv files with ".c000.csv" in the name

MichTalebzadeh — Thu, 14 Mar 2024 16:16:09 GMT

Let us try to simulate this error

from pyspark.sql import SparkSession import os # Create a SparkSession spark = SparkSession.builder.appName("SomeTestsForIncompleteWriteSimulation").getOrCreate() # Sample DataFrame data = [("A", 1), ("B", 2), ("A", 3), ("C", 5)] df = spark.createDataFrame(data, ["col1", "col2"]) # Simulate an error during write try: df.write.partitionBy("col1").mode("overwrite").csv("/tmp/output", header=True) except Exception as e: print("Simulating write error:", e) # Check for existence of "_SUCCESS" file in local /tmp success_file = "/tmp/output/_SUCCESS" if os.path.exists(success_file): print("_SUCCESS file found (might not reflect reality if error occurred earlier)") else: print("_SUCCESS file missing (indicates incomplete write)")

and the output

_SUCCESS file missing (indicates incomplete write)

Re: Not loading csv files with ".c000.csv" in the name

jenshumrich — Fri, 15 Mar 2024 08:21:32 GMT

Thanks Mich, you are partially right and it helped a lot!
Using your code, I was able to see, that it also wrote files with ".c000.csv" at the end. https://stackoverflow.com/questions/54190082/spark-structured-streaming-producing-c000-csv-files says these files might be temporary.
The check, if the file is available must use

if len(dbutils.fs.ls(success_file)) > 0:

print("_SUCCESS file found (might not reflect reality if error occurred earlier)")

else:

print("_SUCCESS file missing (indicates incomplete write)")

though (os checks the local file system of the master node, no?)
And even though the files end with ".c000.csv" , I was able to read them in:
test_df = spark.read.option("basePath", "/tmp/output").csv("/tmp/output/", header=True)

test_df.show()

Re: Not loading csv files with ".c000.csv" in the name

jenshumrich — Fri, 15 Mar 2024 08:26:17 GMT

Then removing the "_commited_" file stops spark form reading in the other files