Databricks

jenshumrich · ‎03-14-2024

Yesterday I created a ton of csv files via

joined_df.write.partitionBy("PartitionColumn").mode("overwrite").csv(

output_path, header=True

)
Today, when working with them I realized, that they were not loaded. Upon investigation I saw that in the PartitionColumn folder are only a "_started_123" and a "par-00123-tic-123[.....].c000.csv" file. So no "_SUCCESS".
When renaming the csv files, they are correctly loaded.
Now my question: What the heck is going on here? Was the writing process broken, and if so, why was this not logged? Why do the files have a ".c000.csv" ending? Why are they not loaded?

MichTalebzadeh · ‎03-14-2024

You are likely confusing Spark with your file naming notation and partitioning. This error is likely due to an incomplete Spark write operation. your Spark job using partitioning created temporary files with ".c000.csv" extension. The missing "_SUCCESS" file suggests the write operation did not finish successfully. You may have data in Spark temporary files, but they may not have been loaded into partitions because they rely on the "_SUCCESS" marker.

IHTH

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile

https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

MichTalebzadeh · ‎03-14-2024

Let us try to simulate this error

from pyspark.sql import SparkSession
import os

# Create a SparkSession
spark = SparkSession.builder.appName("SomeTestsForIncompleteWriteSimulation").getOrCreate()

# Sample DataFrame
data = [("A", 1), ("B", 2), ("A", 3), ("C", 5)]
df = spark.createDataFrame(data, ["col1", "col2"])

# Simulate an error during write
try:
  df.write.partitionBy("col1").mode("overwrite").csv("/tmp/output", header=True)
except Exception as e:
  print("Simulating write error:", e)

# Check for existence of "_SUCCESS" file in local /tmp
success_file = "/tmp/output/_SUCCESS"
if os.path.exists(success_file):
  print("_SUCCESS file found (might not reflect reality if error occurred earlier)")
else:
  print("_SUCCESS file missing (indicates incomplete write)")

and the output

_SUCCESS file missing (indicates incomplete write)

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile

https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

jenshumrich · ‎03-15-2024

Thanks Mich, you are partially right and it helped a lot!
Using your code, I was able to see, that it also wrote files with ".c000.csv" at the end. https://stackoverflow.com/questions/54190082/spark-structured-streaming-producing-c000-csv-files say...
The check, if the file is available must use

if len(dbutils.fs.ls(success_file)) > 0:

print("_SUCCESS file found (might not reflect reality if error occurred earlier)")

else:

print("_SUCCESS file missing (indicates incomplete write)")

though (os checks the local file system of the master node, no?)
And even though the files end with ".c000.csv" , I was able to read them in:
test_df = spark.read.option("basePath", "/tmp/output").csv("/tmp/output/", header=True)

test_df.show()