<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic I need to edit my parquet files, and change field name, replacing space by underscore in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/i-need-to-edit-my-parquet-files-and-change-field-name-replacing/m-p/27268#M19145</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;I am facing trouble as mentioned in following topics in stackoverflow,&lt;/P&gt;
&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/45804534/pyspark-org-apache-spark-sql-analysisexception-attribute-name-contains-inv" target="test_blank"&gt;https://stackoverflow.com/questions/45804534/pyspark-org-apache-spark-sql-analysisexception-attribute-name-contains-inv&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/38191157/spark-dataframe-validating-column-names-for-parquet-writes-scala" target="test_blank"&gt;https://stackoverflow.com/questions/38191157/spark-dataframe-validating-column-names-for-parquet-writes-scala&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;I have tried all the solution mentioned there, but I am getting same error every time. Its like spark cannot read fields with space in them.&lt;/P&gt;
&lt;P&gt;So, I am trying to find any other solution just to rename my fields, and save the parquet files back. After that I will continue my transformation with spark.&lt;/P&gt;
&lt;P&gt;Anyone can help me out.. Loads of love and thanks &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 02 Mar 2020 18:34:16 GMT</pubDate>
    <dc:creator>prakharjain</dc:creator>
    <dc:date>2020-03-02T18:34:16Z</dc:date>
    <item>
      <title>I need to edit my parquet files, and change field name, replacing space by underscore</title>
      <link>https://community.databricks.com/t5/data-engineering/i-need-to-edit-my-parquet-files-and-change-field-name-replacing/m-p/27268#M19145</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;I am facing trouble as mentioned in following topics in stackoverflow,&lt;/P&gt;
&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/45804534/pyspark-org-apache-spark-sql-analysisexception-attribute-name-contains-inv" target="test_blank"&gt;https://stackoverflow.com/questions/45804534/pyspark-org-apache-spark-sql-analysisexception-attribute-name-contains-inv&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/38191157/spark-dataframe-validating-column-names-for-parquet-writes-scala" target="test_blank"&gt;https://stackoverflow.com/questions/38191157/spark-dataframe-validating-column-names-for-parquet-writes-scala&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;I have tried all the solution mentioned there, but I am getting same error every time. Its like spark cannot read fields with space in them.&lt;/P&gt;
&lt;P&gt;So, I am trying to find any other solution just to rename my fields, and save the parquet files back. After that I will continue my transformation with spark.&lt;/P&gt;
&lt;P&gt;Anyone can help me out.. Loads of love and thanks &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 02 Mar 2020 18:34:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/i-need-to-edit-my-parquet-files-and-change-field-name-replacing/m-p/27268#M19145</guid>
      <dc:creator>prakharjain</dc:creator>
      <dc:date>2020-03-02T18:34:16Z</dc:date>
    </item>
    <item>
      <title>Re: I need to edit my parquet files, and change field name, replacing space by underscore</title>
      <link>https://community.databricks.com/t5/data-engineering/i-need-to-edit-my-parquet-files-and-change-field-name-replacing/m-p/27269#M19146</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Looks like it is a known issue/limitation due to Parquet internals, and it will not be fixed. Apparently there is no workaround in Spark.&lt;/P&gt;
&lt;P&gt;&lt;A href="https://issues.apache.org/jira/browse/SPARK-27442" target="test_blank"&gt;https://issues.apache.org/jira/browse/SPARK-27442&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 13 May 2020 11:38:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/i-need-to-edit-my-parquet-files-and-change-field-name-replacing/m-p/27269#M19146</guid>
      <dc:creator>DimitriBlyumin</dc:creator>
      <dc:date>2020-05-13T11:38:34Z</dc:date>
    </item>
    <item>
      <title>Re: I need to edit my parquet files, and change field name, replacing space by underscore</title>
      <link>https://community.databricks.com/t5/data-engineering/i-need-to-edit-my-parquet-files-and-change-field-name-replacing/m-p/27270#M19147</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;One option is to use something other than Spark to read the problematic file, e.g. Pandas, if your file is small enough to fit on the driver node (Pandas will only run on the driver). If you have multiple files - you can loop through them and fix one-by-one.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;import pandas as pd
df = pd.read_parquet('//dbfs/path/to/your/file.parquet')
df = df.rename(columns={
  "Column One" : "col_one", 
  "Column Two" : "col_two"
})
dfSpark = spark.createDataFrame(df) # convert to Spark dataframe
df.to_parquet('//dbfs/path/to/your/fixed/file.parquet') # and/or save fixed Parquet&lt;/CODE&gt;&lt;/PRE&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 21 May 2020 11:48:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/i-need-to-edit-my-parquet-files-and-change-field-name-replacing/m-p/27270#M19147</guid>
      <dc:creator>DimitriBlyumin</dc:creator>
      <dc:date>2020-05-21T11:48:22Z</dc:date>
    </item>
  </channel>
</rss>

