My point was that you are asking for column names from what you consider to be the "first row" and I am telling you that at scale, or if the data volume grows what you consider to be the "first row" may no longer actually be the "first row" unless the data is sorted, "first row" is not a meaningful term.
My point is, spark is or can be a distributed process, and if this data is partitioned into two slices, what at that point is the meaning of "first row"
That said, for small data, single source, it often is.
So here is some code that takes a select of a single row, we can call it "first row" if you want. It takes that single row and builds a list of column names.
Then it takes the schema (column names) from the original dataframe, and rewrites it to use the values from the "first row".
Then it creates a new dataframe, from the old by dropping down to the rdd method.
This code works, but you have been warned, at some point this data was read from a file, specifying the proper settings to read from that file is a safer approach.
If you can not get to that, then perhaps a filter before the limit to make sure you get the row you expect.
I put all the code into a single python cell because this application makes a mess of multiple code blocks.
####
# IGNORE ALL OF THIS, Till you see #PAY ATTENTION
# THis is just internal stuff to show you a working example
# I build a dataframe here, you ALREADY have a dataframe
#####
# Set a SCHEMA for the test case
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType
schema = StructType([
StructField('c1', StringType(), True),
StructField('c2', StringType(), True)
])
# Create some data for the test case
data = [("ID", "Name"),
("5", "Joe"),
("6", "Mary")]
rdd = spark.sparkContext.parallelize(data,1)
df = spark.createDataFrame(rdd,schema)
# You now have a dataframe that is similar to your issue
# +---+----+
# | c1| c2|
# +---+----+
# | ID|Name|
# | 5| Joe|
# | 6|Mary|
# +---+----+
#####
# #PAY ATTENTION from here to the end
# You could just grab this code BELOW and rename your dataframe to df with
# df = <YOUR DF>
# And run the code below
######
# Grab the Schema, or column names
DFSchema = df.schema
# build a list of column names from the "first" row
list_of_new_column_names = []
for i in df.limit(1).collect()[0]:
list_of_new_column_names.append(i)
# You now have a list of Column names that was
# generated by iterating of the values of the "first" row
# Change the Schema to the new values
for i,k in enumerate(DFSchema.fields):
k.name = list_of_new_column_names[i]
# DFSChema now looks like this
#StructType(List(StructField(ID,StringType,true),StructField(Name,StringType,true)))
# Apply the schema to the original dataframe
new_df = spark.createDataFrame(df.rdd, DFSchema)
new_df.show()