Find the below solution which can be used.
Let us consider this is the data in the file.
EMP ID First Name Last Name
1Chris M
2John R
3Amit C
EMP ID: starts at 1 to 10 characters
First Name: starts at 11 to 25 characters
Last name: starts at 36 to 25 characters
DOB: starts at 61 to 30 characters.
Step 1:
Load the file into a dataframe with the below options, this will create a data frame with only one column and all rows:
DFSchema = "single_col string"
Df = (spark.read
.option("header",False)
.schema(DFSchema)
.csv(file_path)
)
Step 2:
Now the single column has to be broken into 3 columns. Follow the steps below
Option 1: Create column by column manually
Df = (Df.withColumn(“EMP_ID”, Df.single_col.substr(1,10))
.withColumn(“First Name”, Df.single_col.substr (11,25))
.withColumn(“Last Name”, Df.single_col.substr(36,25))
)
Option 2: Create columns pro-grammatically
Define the column length details:
Col_Details = [(‘EMP_ID’,(1,10),(‘ First Name’,(11,25)),(‘Last Name’,(36,25))]
Using loop separate the column programmatically:
for col_info in a:
df = df.withColumn(col_info[0],DataFrameReader.value.substr(col_info[1][0],col_info[1][1]))