cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

I need to edit my parquet files, and change field name, replacing space by underscore

prakharjain
New Contributor

Hello,

I am facing trouble as mentioned in following topics in stackoverflow,

https://stackoverflow.com/questions/45804534/pyspark-org-apache-spark-sql-analysisexception-attribut...

https://stackoverflow.com/questions/38191157/spark-dataframe-validating-column-names-for-parquet-wri...

I have tried all the solution mentioned there, but I am getting same error every time. Its like spark cannot read fields with space in them.

So, I am trying to find any other solution just to rename my fields, and save the parquet files back. After that I will continue my transformation with spark.

Anyone can help me out.. Loads of love and thanks 🙂

1 ACCEPTED SOLUTION

Accepted Solutions

DimitriBlyumin
New Contributor III

One option is to use something other than Spark to read the problematic file, e.g. Pandas, if your file is small enough to fit on the driver node (Pandas will only run on the driver). If you have multiple files - you can loop through them and fix one-by-one.

import pandas as pd
df = pd.read_parquet('//dbfs/path/to/your/file.parquet')
df = df.rename(columns={
  "Column One" : "col_one", 
  "Column Two" : "col_two"
})
dfSpark = spark.createDataFrame(df) # convert to Spark dataframe
df.to_parquet('//dbfs/path/to/your/fixed/file.parquet') # and/or save fixed Parquet

View solution in original post

2 REPLIES 2

DimitriBlyumin
New Contributor III

Looks like it is a known issue/limitation due to Parquet internals, and it will not be fixed. Apparently there is no workaround in Spark.

https://issues.apache.org/jira/browse/SPARK-27442

DimitriBlyumin
New Contributor III

One option is to use something other than Spark to read the problematic file, e.g. Pandas, if your file is small enough to fit on the driver node (Pandas will only run on the driver). If you have multiple files - you can loop through them and fix one-by-one.

import pandas as pd
df = pd.read_parquet('//dbfs/path/to/your/file.parquet')
df = df.rename(columns={
  "Column One" : "col_one", 
  "Column Two" : "col_two"
})
dfSpark = spark.createDataFrame(df) # convert to Spark dataframe
df.to_parquet('//dbfs/path/to/your/fixed/file.parquet') # and/or save fixed Parquet

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group