03-14-2024 05:42 AM
Yesterday I created a ton of csv files via
03-14-2024 08:45 AM
You are likely confusing Spark with your file naming notation and partitioning. This error is likely due to an incomplete Spark write operation. your Spark job using partitioning created temporary files with ".c000.csv" extension. The missing "_SUCCESS" file suggests the write operation did not finish successfully. You may have data in Spark temporary files, but they may not have been loaded into partitions because they rely on the "_SUCCESS" marker.
IHTH
03-14-2024 09:16 AM
Let us try to simulate this error
from pyspark.sql import SparkSession
import os
# Create a SparkSession
spark = SparkSession.builder.appName("SomeTestsForIncompleteWriteSimulation").getOrCreate()
# Sample DataFrame
data = [("A", 1), ("B", 2), ("A", 3), ("C", 5)]
df = spark.createDataFrame(data, ["col1", "col2"])
# Simulate an error during write
try:
df.write.partitionBy("col1").mode("overwrite").csv("/tmp/output", header=True)
except Exception as e:
print("Simulating write error:", e)
# Check for existence of "_SUCCESS" file in local /tmp
success_file = "/tmp/output/_SUCCESS"
if os.path.exists(success_file):
print("_SUCCESS file found (might not reflect reality if error occurred earlier)")
else:
print("_SUCCESS file missing (indicates incomplete write)")
and the output
_SUCCESS file missing (indicates incomplete write)
03-15-2024 01:21 AM
Thanks Mich, you are partially right and it helped a lot!
Using your code, I was able to see, that it also wrote files with ".c000.csv" at the end. https://stackoverflow.com/questions/54190082/spark-structured-streaming-producing-c000-csv-files say...
The check, if the file is available must use
03-15-2024 01:26 AM
Then removing the "_commited_" file stops spark form reading in the other files
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now