Databricks

DineshKumar · ‎08-29-2020

I have a dataframe like below. I want to convert the first row as columns for this dataframe. How could I do this. Is there any way to convert it directly.(without using df.first)

usdata.show()
-----+---+------------+------------+--------------------+--------------------+
|        _1|       _2|                  _3|                  _4|           _5|
+----------+---------+--------------------+--------------------+-------------+
|first_name|last_name|        company_name|             address|         city|
|     James|     Butt| "Benton, John B Jr"|  6649 N Blue Gum St|  New Orleans|
| Josephine|  Darakjy|"Chanay, Jeffrey ...| 4 B Blue Ridge Blvd|     Brighton|
|       Art|   Venere|"Chemel, James L ...|8 W Cerritos Ave #54|   Bridgeport|
|     Lenna| Paprocki|Feltz Printing Se...|         639 Main St|    Anchorage|
+----------+---------+--------------------+--------------------+-------------+

User16857282152 · ‎08-31-2020

Unless the dataframe is sorted, "first row" is not guaranteed to be consistent.

I can see about working up some code to do this, it probably is fairly straightforward.

However, if you can specify header = True when reading, the data then this problem is solved for you.

DineshKumar · ‎08-31-2020

This is not reading the data from a file. I have a dataframe like above. I just have to covert the first row as column for the dataframe. thats it.

User16857282152 · ‎09-01-2020

My point was that you are asking for column names from what you consider to be the "first row" and I am telling you that at scale, or if the data volume grows what you consider to be the "first row" may no longer actually be the "first row" unless the data is sorted, "first row" is not a meaningful term.

My point is, spark is or can be a distributed process, and if this data is partitioned into two slices, what at that point is the meaning of "first row"

That said, for small data, single source, it often is.

So here is some code that takes a select of a single row, we can call it "first row" if you want. It takes that single row and builds a list of column names.

Then it takes the schema (column names) from the original dataframe, and rewrites it to use the values from the "first row".

Then it creates a new dataframe, from the old by dropping down to the rdd method.

This code works, but you have been warned, at some point this data was read from a file, specifying the proper settings to read from that file is a safer approach.

If you can not get to that, then perhaps a filter before the limit to make sure you get the row you expect.

I put all the code into a single python cell because this application makes a mess of multiple code blocks.


####
# IGNORE ALL OF THIS, Till you see #PAY ATTENTION
# THis is just internal stuff to show you a working example
# I build a dataframe here, you ALREADY have a dataframe
#####
# Set a SCHEMA for the test case
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType
schema = StructType([
    StructField('c1', StringType(), True),
    StructField('c2', StringType(), True)
])
# Create some data for the test case
data = [("ID", "Name"),
        ("5", "Joe"),
        ("6", "Mary")]
rdd = spark.sparkContext.parallelize(data,1)
df = spark.createDataFrame(rdd,schema)
# You now have a dataframe that is similar to your issue
# +---+----+
# | c1|  c2|
# +---+----+
# | ID|Name|
# |  5| Joe|
# |  6|Mary|
# +---+----+
#####
# #PAY ATTENTION from here to the end
# You could just grab this code BELOW and rename your dataframe to df with
# df = <YOUR DF>
# And run the code below
######
# Grab the Schema, or column names
DFSchema = df.schema
# build a list of column names from the "first" row
list_of_new_column_names = []
for i in df.limit(1).collect()[0]:
  list_of_new_column_names.append(i)
# You now have a list of Column names that was
# generated by iterating of the values of the "first" row
# Change the Schema to the new values
for i,k in enumerate(DFSchema.fields):
  k.name = list_of_new_column_names[i]
# DFSChema now looks like this
#StructType(List(StructField(ID,StringType,true),StructField(Name,StringType,true)))
# Apply the schema to the original dataframe
new_df = spark.createDataFrame(df.rdd, DFSchema)
new_df.show()

Databricks

How to convert the first row as column from an existing dataframe.

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs