cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to convert the first row as column from an existing dataframe.

DineshKumar
New Contributor III

I have a dataframe like below. I want to convert the first row as columns for this dataframe. How could I do this. Is there any way to convert it directly.(without using df.first)

usdata.show()
-----+---+------------+------------+--------------------+--------------------+
|        _1|       _2|                  _3|                  _4|           _5|
+----------+---------+--------------------+--------------------+-------------+
|first_name|last_name|        company_name|             address|         city|
|     James|     Butt| "Benton, John B Jr"|  6649 N Blue Gum St|  New Orleans|
| Josephine|  Darakjy|"Chanay, Jeffrey ...| 4 B Blue Ridge Blvd|     Brighton|
|       Art|   Venere|"Chemel, James L ...|8 W Cerritos Ave #54|   Bridgeport|
|     Lenna| Paprocki|Feltz Printing Se...|         639 Main St|    Anchorage|
+----------+---------+--------------------+--------------------+-------------+

3 REPLIES 3

User16857282152
Contributor

Unless the dataframe is sorted, "first row" is not guaranteed to be consistent.

I can see about working up some code to do this, it probably is fairly straightforward.

However, if you can specify header = True when reading, the data then this problem is solved for you.

This is not reading the data from a file. I have a dataframe like above. I just have to covert the first row as column for the dataframe. thats it.

User16857282152
Contributor

My point was that you are asking for column names from what you consider to be the "first row" and I am telling you that at scale, or if the data volume grows what you consider to be the "first row" may no longer actually be the "first row" unless the data is sorted, "first row" is not a meaningful term.

My point is, spark is or can be a distributed process, and if this data is partitioned into two slices, what at that point is the meaning of "first row"

That said, for small data, single source, it often is.

So here is some code that takes a select of a single row, we can call it "first row" if you want. It takes that single row and builds a list of column names.

Then it takes the schema (column names) from the original dataframe, and rewrites it to use the values from the "first row".

Then it creates a new dataframe, from the old by dropping down to the rdd method.

This code works, but you have been warned, at some point this data was read from a file, specifying the proper settings to read from that file is a safer approach.

If you can not get to that, then perhaps a filter before the limit to make sure you get the row you expect.

I put all the code into a single python cell because this application makes a mess of multiple code blocks.


####
# IGNORE ALL OF THIS, Till you see #PAY ATTENTION
# THis is just internal stuff to show you a working example
# I build a dataframe here, you ALREADY have a dataframe
#####
# Set a SCHEMA for the test case
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType
schema = StructType([
    StructField('c1', StringType(), True),
    StructField('c2', StringType(), True)
])
# Create some data for the test case
data = [("ID", "Name"),
        ("5", "Joe"),
        ("6", "Mary")]
rdd = spark.sparkContext.parallelize(data,1)
df = spark.createDataFrame(rdd,schema)
# You now have a dataframe that is similar to your issue
# +---+----+
# | c1|  c2|
# +---+----+
# | ID|Name|
# |  5| Joe|
# |  6|Mary|
# +---+----+
#####
# #PAY ATTENTION from here to the end
# You could just grab this code BELOW and rename your dataframe to df with
# df = <YOUR DF>
# And run the code below
######
# Grab the Schema, or column names
DFSchema = df.schema
# build a list of column names from the "first" row
list_of_new_column_names = []
for i in df.limit(1).collect()[0]:
  list_of_new_column_names.append(i)
# You now have a list of Column names that was
# generated by iterating of the values of the "first" row
# Change the Schema to the new values
for i,k in enumerate(DFSchema.fields):
  k.name = list_of_new_column_names[i]
# DFSChema now looks like this
#StructType(List(StructField(ID,StringType,true),StructField(Name,StringType,true)))
# Apply the schema to the original dataframe
new_df = spark.createDataFrame(df.rdd, DFSchema)
new_df.show()

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.