Not loading csv files with ".c000.csv" in the name
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2024 05:42 AM
Yesterday I created a ton of csv files via
Today, when working with them I realized, that they were not loaded. Upon investigation I saw that in the PartitionColumn folder are only a "_started_123" and a "par-00123-tic-123[.....].c000.csv" file. So no "_SUCCESS".
When renaming the csv files, they are correctly loaded.
Now my question: What the heck is going on here? Was the writing process broken, and if so, why was this not logged? Why do the files have a ".c000.csv" ending? Why are they not loaded?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2024 08:45 AM
You are likely confusing Spark with your file naming notation and partitioning. This error is likely due to an incomplete Spark write operation. your Spark job using partitioning created temporary files with ".c000.csv" extension. The missing "_SUCCESS" file suggests the write operation did not finish successfully. You may have data in Spark temporary files, but they may not have been loaded into partitions because they rely on the "_SUCCESS" marker.
IHTH
London
United Kingdom
view my Linkedin profile
https://en.everybodywiki.com/Mich_Talebzadeh
Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2024 09:16 AM
Let us try to simulate this error
from pyspark.sql import SparkSession
import os
# Create a SparkSession
spark = SparkSession.builder.appName("SomeTestsForIncompleteWriteSimulation").getOrCreate()
# Sample DataFrame
data = [("A", 1), ("B", 2), ("A", 3), ("C", 5)]
df = spark.createDataFrame(data, ["col1", "col2"])
# Simulate an error during write
try:
df.write.partitionBy("col1").mode("overwrite").csv("/tmp/output", header=True)
except Exception as e:
print("Simulating write error:", e)
# Check for existence of "_SUCCESS" file in local /tmp
success_file = "/tmp/output/_SUCCESS"
if os.path.exists(success_file):
print("_SUCCESS file found (might not reflect reality if error occurred earlier)")
else:
print("_SUCCESS file missing (indicates incomplete write)")
and the output
_SUCCESS file missing (indicates incomplete write)
London
United Kingdom
view my Linkedin profile
https://en.everybodywiki.com/Mich_Talebzadeh
Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2024 01:21 AM
Thanks Mich, you are partially right and it helped a lot!
Using your code, I was able to see, that it also wrote files with ".c000.csv" at the end. https://stackoverflow.com/questions/54190082/spark-structured-streaming-producing-c000-csv-files say...
The check, if the file is available must use
And even though the files end with ".c000.csv" , I was able to read them in:
test_df = spark.read.option("basePath", "/tmp/output").csv("/tmp/output/", header=True)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2024 01:26 AM
Then removing the "_commited_" file stops spark form reading in the other files

