Nagendra
Databricks Partner

Find the below solution which can be used.

Let us consider this is the data in the file.

EMP ID   First Name              Last Name               

        1Chris                   M                                             

        2John                    R                                            

        3Amit                    C                                            

EMP ID: starts at 1 to 10 characters

First Name:  starts at 11 to 25 characters

Last name: starts at 36 to 25 characters

DOB: starts at 61 to 30 characters.

Step 1:

Load the file into a dataframe with the below options, this will create a data frame with only one column and all rows:

DFSchema = "single_col string"

Df = (spark.read

                            .option("header",False)

                           .schema(DFSchema)

                           .csv(file_path)

                 )

Step 2:

Now the single column has to be broken into 3 columns. Follow the steps below

Option 1: Create column by column manually

Df = (Df.withColumn(“EMP_ID”, Df.single_col.substr(1,10))

             .withColumn(“First Name”, Df.single_col.substr (11,25))

.withColumn(“Last Name”, Df.single_col.substr(36,25))

)

Option 2: Create columns pro-grammatically

Define the column length details:

Col_Details = [(‘EMP_ID’,(1,10),(‘ First Name’,(11,25)),(‘Last Name’,(36,25))]

Using loop separate the column programmatically:

for col_info in a:

 df = df.withColumn(col_info[0],DataFrameReader.value.substr(col_info[1][0],col_info[1][1]))