cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Date field getting changed when reading from excel file to dataframe

sreedata
New Contributor III

The date field is getting changed while reading data from source .xls file to the dataframe. In the source xl file all columns are strings but i am not sure why date column alone behaves differently

In Source file date is 1/24/2022.

In dataframe it is 1/24/22

Code used:

from pyspark.sql.functions import *

import pyspark.sql.functions as sf

import pyspark.sql.types

import pandas as pd

import os

import glob

filenames = glob.glob(PathSource + "/*.xls")

dfs = []

for df in dfs: 

  xl_file = pd.ExcelFile(filenames)

  df=xl_file.parse('Sheet1')

  dfs.concat(df, ignore_index=True)

   

display(df)

Thanks in Advance for any help or guidance.

1 ACCEPTED SOLUTION

Accepted Solutions

merca
Valued Contributor II

@srikanth nair​ , Have you checked the output in pandas and eventually pass the parse_dates=False to ignore dates. Pandas uses dateutil.parser.parser as default

View solution in original post

4 REPLIES 4

merca
Valued Contributor II

@srikanth nair​ , Have you checked the output in pandas and eventually pass the parse_dates=False to ignore dates. Pandas uses dateutil.parser.parser as default

Anonymous
Not applicable

Hi @sreedata (Customer)​ , Just a friendly follow-up. Do you still need help, or @merca (Customer)​ 's response help you to find the solution? Please let us know.

sreedata
New Contributor III

working fine now thanks

Pradeep_Namani
New Contributor III

Hi Team, @Merca Ovnerud​ 

I am also facing same issue , below is the code snippet which I am using

df=spark.read.format("com.crealytics.spark.excel").option("header","true").load("/mnt/dataplatform/Tenant_PK/Results.xlsx")

I have a couple of date columns , all are showing dd/mm/yy format but it has to come as dd/mm/yyyy format

source file has: 26-03-1950

Dataframe has : 26-03-50

I have used parse_dates=False but it is not working, Can any one help on this

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group