03-31-2022 06:47 AM
The date field is getting changed while reading data from source .xls file to the dataframe. In the source xl file all columns are strings but i am not sure why date column alone behaves differently
In Source file date is 1/24/2022.
In dataframe it is 1/24/22
Code used:
from pyspark.sql.functions import *
import pyspark.sql.functions as sf
import pyspark.sql.types
import pandas as pd
import os
import glob
filenames = glob.glob(PathSource + "/*.xls")
dfs = []
for df in dfs:
xl_file = pd.ExcelFile(filenames)
df=xl_file.parse('Sheet1')
dfs.concat(df, ignore_index=True)
display(df)
Thanks in Advance for any help or guidance.
04-02-2022 10:31 AM
@srikanth nair , Have you checked the output in pandas and eventually pass the parse_dates=False to ignore dates. Pandas uses dateutil.parser.parser as default
04-02-2022 10:31 AM
@srikanth nair , Have you checked the output in pandas and eventually pass the parse_dates=False to ignore dates. Pandas uses dateutil.parser.parser as default
04-26-2022 09:18 AM
Hi @srikanth nair , Just a friendly follow-up. Do you still need help, or @Merca Ovnerud 's response help you to find the solution? Please let us know.
05-18-2022 11:16 PM
05-19-2022 06:45 AM
working fine now thanks
11-17-2022 06:56 AM
Hi Team, @Merca Ovnerud
I am also facing same issue , below is the code snippet which I am using
df=spark.read.format("com.crealytics.spark.excel").option("header","true").load("/mnt/dataplatform/Tenant_PK/Results.xlsx")
I have a couple of date columns , all are showing dd/mm/yy format but it has to come as dd/mm/yyyy format
source file has: 26-03-1950
Dataframe has : 26-03-50
I have used parse_dates=False but it is not working, Can any one help on this
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.