<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Error writing parquet to specific container in Azure Data Lake in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/error-writing-parquet-to-specific-container-in-azure-data-lake/m-p/21193#M14410</link>
    <description>&lt;P&gt;I'm retrieving two files from container1, transforming them and merging before writing to a container2 within the same Storage Account in Azure. I'm mounting container1, unmouting and mounting countainer2 before writing. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My code for writing the parquet&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
df_spark.coalesce(1).write.option("header",True) \
        .partitionBy('ZMTART') \
        .mode("overwrite") \
        .parquet('/mnt/temp/')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I'm getting the following error when writing to container2:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
&amp;lt;command-3769031361803403&amp;gt; in &amp;lt;cell line: 2&amp;gt;()
      1 spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
----&amp;gt; 2 df_spark.coalesce(1).write.option("header",True) \
      3         .partitionBy('ZMTART') \
      4         .mode("overwrite") \
      5         .parquet('/mnt/temp/')
&amp;nbsp;
/databricks/spark/python/pyspark/instrumentation_utils.py in wrapper(*args, **kwargs)
     46             start = time.perf_counter()
     47             try:
---&amp;gt; 48                 res = func(*args, **kwargs)
     49                 logger.log_success(
     50                     module_name, class_name, function_name, time.perf_counter() - start, signature
&amp;nbsp;
/databricks/spark/python/pyspark/sql/readwriter.py in parquet(self, path, mode, partitionBy, compression)
   1138             self.partitionBy(partitionBy)
   1139         self._set_opts(compression=compression)
-&amp;gt; 1140         self._jwrite.parquet(path)
   1141 &lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The odd thing is writing the exact same dataframe to the container1 is no problem, even using the same code for writing but with different mount. Generating random data in the script and writing that to container2 is also no problem. Evidently, there is a problem with that specific dataframe in that specific container. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm fairly new to Databricks, so please let me know if there is additional information needed.  &lt;/P&gt;</description>
    <pubDate>Tue, 22 Nov 2022 15:50:42 GMT</pubDate>
    <dc:creator>magnus778</dc:creator>
    <dc:date>2022-11-22T15:50:42Z</dc:date>
    <item>
      <title>Error writing parquet to specific container in Azure Data Lake</title>
      <link>https://community.databricks.com/t5/data-engineering/error-writing-parquet-to-specific-container-in-azure-data-lake/m-p/21193#M14410</link>
      <description>&lt;P&gt;I'm retrieving two files from container1, transforming them and merging before writing to a container2 within the same Storage Account in Azure. I'm mounting container1, unmouting and mounting countainer2 before writing. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My code for writing the parquet&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
df_spark.coalesce(1).write.option("header",True) \
        .partitionBy('ZMTART') \
        .mode("overwrite") \
        .parquet('/mnt/temp/')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I'm getting the following error when writing to container2:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
&amp;lt;command-3769031361803403&amp;gt; in &amp;lt;cell line: 2&amp;gt;()
      1 spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
----&amp;gt; 2 df_spark.coalesce(1).write.option("header",True) \
      3         .partitionBy('ZMTART') \
      4         .mode("overwrite") \
      5         .parquet('/mnt/temp/')
&amp;nbsp;
/databricks/spark/python/pyspark/instrumentation_utils.py in wrapper(*args, **kwargs)
     46             start = time.perf_counter()
     47             try:
---&amp;gt; 48                 res = func(*args, **kwargs)
     49                 logger.log_success(
     50                     module_name, class_name, function_name, time.perf_counter() - start, signature
&amp;nbsp;
/databricks/spark/python/pyspark/sql/readwriter.py in parquet(self, path, mode, partitionBy, compression)
   1138             self.partitionBy(partitionBy)
   1139         self._set_opts(compression=compression)
-&amp;gt; 1140         self._jwrite.parquet(path)
   1141 &lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The odd thing is writing the exact same dataframe to the container1 is no problem, even using the same code for writing but with different mount. Generating random data in the script and writing that to container2 is also no problem. Evidently, there is a problem with that specific dataframe in that specific container. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm fairly new to Databricks, so please let me know if there is additional information needed.  &lt;/P&gt;</description>
      <pubDate>Tue, 22 Nov 2022 15:50:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/error-writing-parquet-to-specific-container-in-azure-data-lake/m-p/21193#M14410</guid>
      <dc:creator>magnus778</dc:creator>
      <dc:date>2022-11-22T15:50:42Z</dc:date>
    </item>
    <item>
      <title>Re: Error writing parquet to specific container in Azure Data Lake</title>
      <link>https://community.databricks.com/t5/data-engineering/error-writing-parquet-to-specific-container-in-azure-data-lake/m-p/21194#M14411</link>
      <description>&lt;P&gt;Hi @Magnus Asperud​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;1 mounting container1&lt;/P&gt;&lt;P&gt;2 you should persist the data somewhere, creating df doesnt mean that you are reading data from container and have it accessible after unmounting. Make sure to store this merged data somewhere.&lt;/P&gt;&lt;P&gt; Not sure if this will work&lt;/P&gt;&lt;P&gt; df_spark.cache()&lt;/P&gt;&lt;P&gt; df_spark.count()&lt;/P&gt;&lt;P&gt;3 unmounting&lt;/P&gt;&lt;P&gt;4 mounting container2&lt;/P&gt;</description>
      <pubDate>Tue, 22 Nov 2022 18:06:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/error-writing-parquet-to-specific-container-in-azure-data-lake/m-p/21194#M14411</guid>
      <dc:creator>Pat</dc:creator>
      <dc:date>2022-11-22T18:06:35Z</dc:date>
    </item>
    <item>
      <title>Re: Error writing parquet to specific container in Azure Data Lake</title>
      <link>https://community.databricks.com/t5/data-engineering/error-writing-parquet-to-specific-container-in-azure-data-lake/m-p/21195#M14412</link>
      <description>&lt;P&gt;.cache() seems to work perfectly, thank you!&lt;/P&gt;</description>
      <pubDate>Tue, 22 Nov 2022 21:31:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/error-writing-parquet-to-specific-container-in-azure-data-lake/m-p/21195#M14412</guid>
      <dc:creator>magnus778</dc:creator>
      <dc:date>2022-11-22T21:31:01Z</dc:date>
    </item>
  </channel>
</rss>

