cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Date field getting changed when reading from excel file to dataframe

sreedata
New Contributor III

The date field is getting changed while reading data from source .xls file to the dataframe. In the source xl file all columns are strings but i am not sure why date column alone behaves differently

In Source file date is 1/24/2022.

In dataframe it is 1/24/22

Code used:

from pyspark.sql.functions import *

import pyspark.sql.functions as sf

import pyspark.sql.types

import pandas as pd

import os

import glob

filenames = glob.glob(PathSource + "/*.xls")

dfs = []

for df in dfs: 

  xl_file = pd.ExcelFile(filenames)

  df=xl_file.parse('Sheet1')

  dfs.concat(df, ignore_index=True)

   

display(df)

Thanks in Advance for any help or guidance.

1 ACCEPTED SOLUTION

Accepted Solutions

merca
Valued Contributor II

@srikanth nair​ , Have you checked the output in pandas and eventually pass the parse_dates=False to ignore dates. Pandas uses dateutil.parser.parser as default

View solution in original post

5 REPLIES 5

merca
Valued Contributor II

@srikanth nair​ , Have you checked the output in pandas and eventually pass the parse_dates=False to ignore dates. Pandas uses dateutil.parser.parser as default

Kaniz
Community Manager
Community Manager

Hi @srikanth nair​ , Just a friendly follow-up. Do you still need help, or @Merca Ovnerud​ 's response help you to find the solution? Please let us know.

Anonymous
Not applicable

Hi @sreedata (Customer)​ , Just a friendly follow-up. Do you still need help, or @merca (Customer)​ 's response help you to find the solution? Please let us know.

sreedata
New Contributor III

working fine now thanks

Pradeep_Namani
New Contributor III

Hi Team, @Merca Ovnerud​ 

I am also facing same issue , below is the code snippet which I am using

df=spark.read.format("com.crealytics.spark.excel").option("header","true").load("/mnt/dataplatform/Tenant_PK/Results.xlsx")

I have a couple of date columns , all are showing dd/mm/yy format but it has to come as dd/mm/yyyy format

source file has: 26-03-1950

Dataframe has : 26-03-50

I have used parse_dates=False but it is not working, Can any one help on this

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.