Re: How to read a fixed length file in Spark using...

Nagendra · ‎10-29-2021

Find the below solution which can be used.

Let us consider this is the data in the file.

EMP ID First Name Last Name

1Chris M

2John R

3Amit C

EMP ID: starts at 1 to 10 characters

First Name: starts at 11 to 25 characters

Last name: starts at 36 to 25 characters

DOB: starts at 61 to 30 characters.

Step 1:

Load the file into a dataframe with the below options, this will create a data frame with only one column and all rows:

DFSchema = "single_col string"

Df = (spark.read

.option("header",False)

.schema(DFSchema)

.csv(file_path)

)

Step 2:

Now the single column has to be broken into 3 columns. Follow the steps below

Option 1: Create column by column manually

Df = (Df.withColumn(“EMP_ID”, Df.single_col.substr(1,10))

.withColumn(“First Name”, Df.single_col.substr (11,25))

.withColumn(“Last Name”, Df.single_col.substr(36,25))

)

Option 2: Create columns pro-grammatically

Define the column length details:

Col_Details = [(‘EMP_ID’,(1,10),(‘ First Name’,(11,25)),(‘Last Name’,(36,25))]

Using loop separate the column programmatically:

for col_info in a:

df = df.withColumn(col_info[0],DataFrameReader.value.substr(col_info[1][0],col_info[1][1]))