cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

When reading a csv file with Spark.read, the data is not loading in the appropriate column while pas

Anonymous
Not applicable

I am trying to read a csv file from storage location using spark.read function. Also, i am explicitly passing the schema to the function. However, the data is not loading in proper column of the dataframe. Following are the code details:

from pyspark.sql.types import StructType, StructField, StringType, DateType, DoubleType

# Define the schema
schema = StructType([
StructField('TRANSACTION', StringType(), True),
StructField('FROM', StringType(), True),
StructField('TO', StringType(), True),
StructField('DA_RATE', DateType(), True),
StructField('CURNCY_F', StringType(), True),
StructField('CURNCY_T', StringType(), True)
])

# Read the CSV file with the specified schema
df = spark.read.format("csv") \
.option("header", "true") \
.option("delimiter", "|") \
.schema(schema) \
.load("abfss://xyz@abc.dfs.core.windows.net/my/2024-04-10/abc_2*.csv")

**Data in the csv file**

DA_RATE|CURNCY_F|CURNCY_T
2024-02-26|AAA|MMM
2024-02-26|AAA|NNN
2024-02-26|BBB|YYY
2024-02-26|CCC|KKK
2024-02-27|DDD|SSS


Output I am getting

TRANSACTION FROM TO DA_RATE CURNCY_F CURNCY_T
2024-02-26 AAA MMM null null null
2024-02-26 AAA NNN null null null
2024-02-26 BBB YYY null null null
2024-02-26 CCC KKK null null null

**Output I am expected**

TRANSACTION FROM TO DA_RATE CURNCY_F CURNCY_T
null null null 2024-02-26 AAA MMM
null null null 2024-02-26 AAA NNN
null null null 2024-02-26 BBB YYY

 

2 REPLIES 2

ThomazRossito
Contributor

Hi,

I noticed that in the scheme you are creating, there are more columns than in your csv file, I was able to understand that the final result needs to include the 6 columns

I would use withColumn, for the 3 columns that do not exist in the file

Below is an example

 

df = (
spark.read
.format("csv")
.option("header","true")
.option("delimiter","|")
.load("pathFile")
.withColumn("TRANSACTION", lit('null').cast(StringType()))
.withColumn("FROM", lit('null').cast(StringType()))
.withColumn("TO", lit('null').cast(StringType()))
)
 
Hope this helps
Att.
Thomaz Antonio Rossito Neto
Master Data Specialist - Data Architect | Data Engineer @ CI&T

sai_sathya
New Contributor III

Hi , i would suggest to approach as suggested by Thomaz Rossito,

but maybe you can give it as an try like swapping the struct field order like this following

schema = StructType([
StructField('DA_RATE', DateType(), True),
StructField('CURNCY_F', StringType(), True),
StructField('CURNCY_T', StringType(), True),

StructField('TRANSACTION', StringType(), True),
StructField('FROM', StringType(), True),
StructField('TO', StringType(), True)])

may if you want to have your selection order as defined in schema you can later again adjust in your dataframe by using .select() and choose your prefered columns inside the select and have an new dataframe

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group