<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Not loading csv files with &amp;quot;.c000.csv&amp;quot; in the name in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/not-loading-csv-files-with-quot-c000-csv-quot-in-the-name/m-p/63713#M6805</link>
    <description>&lt;P&gt;You are likely confusing Spark with your file naming notation and partitioning. This error is likely due to an incomplete Spark write operation. y&lt;SPAN&gt;our Spark job using partitioning created temporary files with ".c000.csv" extension.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;The missing "_SUCCESS" file suggests the write operation did not finish successfully. You may have data in Spark temporary files,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;but they may not have been loaded into partitions because they rely on the "_SUCCESS" marker.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;I&lt;SPAN&gt;HTH&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 14 Mar 2024 15:45:08 GMT</pubDate>
    <dc:creator>MichTalebzadeh</dc:creator>
    <dc:date>2024-03-14T15:45:08Z</dc:date>
    <item>
      <title>Not loading csv files with ".c000.csv" in the name</title>
      <link>https://community.databricks.com/t5/get-started-discussions/not-loading-csv-files-with-quot-c000-csv-quot-in-the-name/m-p/63696#M6804</link>
      <description>&lt;P&gt;Yesterday I created a ton of csv files via&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;joined_df.write.&lt;/SPAN&gt;&lt;SPAN&gt;partitionBy&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"PartitionColumn"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;mode&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;csv&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; output_path, &lt;/SPAN&gt;&lt;SPAN&gt;header&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;True&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; )&lt;BR /&gt;Today, when working with them I realized, that they were not loaded. Upon investigation I saw that in the PartitionColumn folder are only a "_started_123" and a "par-00123-tic-123[.....].c000.csv" file. So no "_SUCCESS".&lt;BR /&gt;When renaming the csv files, they are correctly loaded.&amp;nbsp;&lt;BR /&gt;Now my question: What the heck is going on here? Was the writing process broken, and if so, why was this not logged? Why do the files have a ".c000.csv" ending? Why are they not loaded?&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 14 Mar 2024 12:42:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/not-loading-csv-files-with-quot-c000-csv-quot-in-the-name/m-p/63696#M6804</guid>
      <dc:creator>jenshumrich</dc:creator>
      <dc:date>2024-03-14T12:42:48Z</dc:date>
    </item>
    <item>
      <title>Re: Not loading csv files with ".c000.csv" in the name</title>
      <link>https://community.databricks.com/t5/get-started-discussions/not-loading-csv-files-with-quot-c000-csv-quot-in-the-name/m-p/63713#M6805</link>
      <description>&lt;P&gt;You are likely confusing Spark with your file naming notation and partitioning. This error is likely due to an incomplete Spark write operation. y&lt;SPAN&gt;our Spark job using partitioning created temporary files with ".c000.csv" extension.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;The missing "_SUCCESS" file suggests the write operation did not finish successfully. You may have data in Spark temporary files,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;but they may not have been loaded into partitions because they rely on the "_SUCCESS" marker.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;I&lt;SPAN&gt;HTH&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 14 Mar 2024 15:45:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/not-loading-csv-files-with-quot-c000-csv-quot-in-the-name/m-p/63713#M6805</guid>
      <dc:creator>MichTalebzadeh</dc:creator>
      <dc:date>2024-03-14T15:45:08Z</dc:date>
    </item>
    <item>
      <title>Re: Not loading csv files with ".c000.csv" in the name</title>
      <link>https://community.databricks.com/t5/get-started-discussions/not-loading-csv-files-with-quot-c000-csv-quot-in-the-name/m-p/63714#M6806</link>
      <description>&lt;P&gt;Let us try to simulate this error&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql import SparkSession
import os

# Create a SparkSession
spark = SparkSession.builder.appName("SomeTestsForIncompleteWriteSimulation").getOrCreate()

# Sample DataFrame
data = [("A", 1), ("B", 2), ("A", 3), ("C", 5)]
df = spark.createDataFrame(data, ["col1", "col2"])

# Simulate an error during write
try:
  df.write.partitionBy("col1").mode("overwrite").csv("/tmp/output", header=True)
except Exception as e:
  print("Simulating write error:", e)

# Check for existence of "_SUCCESS" file in local /tmp
success_file = "/tmp/output/_SUCCESS"
if os.path.exists(success_file):
  print("_SUCCESS file found (might not reflect reality if error occurred earlier)")
else:
  print("_SUCCESS file missing (indicates incomplete write)")
&lt;/LI-CODE&gt;&lt;P&gt;and the output&lt;/P&gt;&lt;LI-CODE lang="python"&gt;_SUCCESS file missing (indicates incomplete write)&lt;/LI-CODE&gt;</description>
      <pubDate>Thu, 14 Mar 2024 16:16:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/not-loading-csv-files-with-quot-c000-csv-quot-in-the-name/m-p/63714#M6806</guid>
      <dc:creator>MichTalebzadeh</dc:creator>
      <dc:date>2024-03-14T16:16:09Z</dc:date>
    </item>
    <item>
      <title>Re: Not loading csv files with ".c000.csv" in the name</title>
      <link>https://community.databricks.com/t5/get-started-discussions/not-loading-csv-files-with-quot-c000-csv-quot-in-the-name/m-p/63775#M6807</link>
      <description>&lt;P&gt;Thanks Mich, you are partially right and it helped a lot!&lt;BR /&gt;Using your code, I was able to see, that it&amp;nbsp;also wrote files with ".c000.csv"&amp;nbsp; at the end. &lt;A href="https://stackoverflow.com/questions/54190082/spark-structured-streaming-producing-c000-csv-files," target="_blank"&gt;https://stackoverflow.com/questions/54190082/spark-structured-streaming-producing-c000-csv-files says these files might be temporary.&lt;/A&gt;&lt;BR /&gt;The check, if the file is available must use&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;if&lt;/SPAN&gt; &lt;SPAN&gt;len&lt;/SPAN&gt;&lt;SPAN&gt;(dbutils.fs.&lt;/SPAN&gt;&lt;SPAN&gt;ls&lt;/SPAN&gt;&lt;SPAN&gt;(success_file)) &lt;/SPAN&gt;&lt;SPAN&gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN&gt;0&lt;/SPAN&gt;&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"_SUCCESS file found (might not reflect reality if error occurred earlier)"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;else&lt;/SPAN&gt;&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"_SUCCESS file missing (indicates incomplete write)"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;though (os checks the local file system of the master node, no?)&lt;BR /&gt;And even though the files end with ".c000.csv" , I was able to read them in:&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;test_df &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; spark.read.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"basePath"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"/tmp/output"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;csv&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"/tmp/output/"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;header&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;True&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;test_df.&lt;/SPAN&gt;&lt;SPAN&gt;show&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="jenshumrich_0-1710490864667.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/6671i674ACFEBF7CD0572/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="jenshumrich_0-1710490864667.png" alt="jenshumrich_0-1710490864667.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 15 Mar 2024 08:21:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/not-loading-csv-files-with-quot-c000-csv-quot-in-the-name/m-p/63775#M6807</guid>
      <dc:creator>jenshumrich</dc:creator>
      <dc:date>2024-03-15T08:21:32Z</dc:date>
    </item>
    <item>
      <title>Re: Not loading csv files with ".c000.csv" in the name</title>
      <link>https://community.databricks.com/t5/get-started-discussions/not-loading-csv-files-with-quot-c000-csv-quot-in-the-name/m-p/63780#M6808</link>
      <description>&lt;P&gt;Then removing the "_commited_" file stops spark form reading in the other files&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="jenshumrich_1-1710491115337.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/6674i02490D02AF99D107/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="jenshumrich_1-1710491115337.png" alt="jenshumrich_1-1710491115337.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 15 Mar 2024 08:26:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/not-loading-csv-files-with-quot-c000-csv-quot-in-the-name/m-p/63780#M6808</guid>
      <dc:creator>jenshumrich</dc:creator>
      <dc:date>2024-03-15T08:26:17Z</dc:date>
    </item>
  </channel>
</rss>

