cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Not loading csv files with ".c000.csv" in the name

jenshumrich
Contributor

Yesterday I created a ton of csv files via 

joined_df.write.partitionBy("PartitionColumn").mode("overwrite").csv(
            output_path, header=True
        )
Today, when working with them I realized, that they were not loaded. Upon investigation I saw that in the PartitionColumn folder are only a "_started_123" and a "par-00123-tic-123[.....].c000.csv" file. So no "_SUCCESS".
When renaming the csv files, they are correctly loaded. 
Now my question: What the heck is going on here? Was the writing process broken, and if so, why was this not logged? Why do the files have a ".c000.csv" ending? Why are they not loaded?

4 REPLIES 4

MichTalebzadeh
Valued Contributor

You are likely confusing Spark with your file naming notation and partitioning. This error is likely due to an incomplete Spark write operation. your Spark job using partitioning created temporary files with ".c000.csv" extension. The missing "_SUCCESS" file suggests the write operation did not finish successfully. You may have data in Spark temporary files, but they may not have been loaded into partitions because they rely on the "_SUCCESS" marker.

IHTH

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

Let us try to simulate this error

from pyspark.sql import SparkSession
import os

# Create a SparkSession
spark = SparkSession.builder.appName("SomeTestsForIncompleteWriteSimulation").getOrCreate()

# Sample DataFrame
data = [("A", 1), ("B", 2), ("A", 3), ("C", 5)]
df = spark.createDataFrame(data, ["col1", "col2"])

# Simulate an error during write
try:
  df.write.partitionBy("col1").mode("overwrite").csv("/tmp/output", header=True)
except Exception as e:
  print("Simulating write error:", e)

# Check for existence of "_SUCCESS" file in local /tmp
success_file = "/tmp/output/_SUCCESS"
if os.path.exists(success_file):
  print("_SUCCESS file found (might not reflect reality if error occurred earlier)")
else:
  print("_SUCCESS file missing (indicates incomplete write)")

and the output

_SUCCESS file missing (indicates incomplete write)
Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

jenshumrich
Contributor

Thanks Mich, you are partially right and it helped a lot!
Using your code, I was able to see, that it also wrote files with ".c000.csv"  at the end. https://stackoverflow.com/questions/54190082/spark-structured-streaming-producing-c000-csv-files say...
The check, if the file is available must use 

if len(dbutils.fs.ls(success_file)) > 0:
  print("_SUCCESS file found (might not reflect reality if error occurred earlier)")
else:
  print("_SUCCESS file missing (indicates incomplete write)")
though (os checks the local file system of the master node, no?)
And even though the files end with ".c000.csv" , I was able to read them in:
test_df = spark.read.option("basePath", "/tmp/output").csv("/tmp/output/", header=True)
test_df.show()

jenshumrich_0-1710490864667.png

 

jenshumrich
Contributor

Then removing the "_commited_" file stops spark form reading in the other files

jenshumrich_1-1710491115337.png

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group