03-31-2022 06:47 AM
The date field is getting changed while reading data from source .xls file to the dataframe. In the source xl file all columns are strings but i am not sure why date column alone behaves differently
In Source file date is 1/24/2022.
In dataframe it is 1/24/22
Code used:
from pyspark.sql.functions import *
import pyspark.sql.functions as sf
import pyspark.sql.types
import pandas as pd
import os
import glob
filenames = glob.glob(PathSource + "/*.xls")
dfs = []
for df in dfs:
xl_file = pd.ExcelFile(filenames)
df=xl_file.parse('Sheet1')
dfs.concat(df, ignore_index=True)
display(df)
Thanks in Advance for any help or guidance.
04-02-2022 10:31 AM
@srikanth nair , Have you checked the output in pandas and eventually pass the parse_dates=False to ignore dates. Pandas uses dateutil.parser.parser as default
04-02-2022 10:31 AM
@srikanth nair , Have you checked the output in pandas and eventually pass the parse_dates=False to ignore dates. Pandas uses dateutil.parser.parser as default
04-26-2022 09:18 AM
Hi @srikanth nair , Just a friendly follow-up. Do you still need help, or @Merca Ovnerud 's response help you to find the solution? Please let us know.
05-18-2022 11:16 PM
05-19-2022 06:45 AM
working fine now thanks
11-17-2022 06:56 AM
Hi Team, @Merca Ovnerud
I am also facing same issue , below is the code snippet which I am using
df=spark.read.format("com.crealytics.spark.excel").option("header","true").load("/mnt/dataplatform/Tenant_PK/Results.xlsx")
I have a couple of date columns , all are showing dd/mm/yy format but it has to come as dd/mm/yyyy format
source file has: 26-03-1950
Dataframe has : 26-03-50
I have used parse_dates=False but it is not working, Can any one help on this
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group