Date field getting changed when reading from excel file to dataframe in pyspark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-23-2022 10:40 PM
The date field is getting changed while reading data from source .xls file to the dataframe. In the source xl file all columns are strings but i am not sure why date column alone behaves differently
In Source file date is 1/24/1947.
In pyspark dataframe it is 1/24/47
Code used:
df=spark.read.format("com.crealytics.spark.excel").option("header","true").load("/mnt/dataplatform/Tenant_PK/Results.xlsx")
If I use option("inforscheme","true") the data coming properly , but I dont want use inforschema, Can any one suggest me any solution.
Thanks in advance
- Labels:
-
Date
-
Date Field
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-23-2022 10:52 PM
hi @Pradeep Namani ,
could you plz try to run below one. I hope so it will work without inferschema
df=spark.read.format("csv").option("header","true").load("/mnt/dataplatform/Tenant_PK/Results.xlsx")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-23-2022 11:12 PM
Thank you @Yogita Chavan for replying , but when I am reading file as csv it is showing all data in different format, I am attaching the screen shot
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-23-2022 11:13 PM
also u can refer below one
https://mayur-saparia7.medium.com/reading-excel-file-in-pyspark-databricks-notebook-c75a63181548
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2022 03:08 AM
I have tried the option which we have give in above url but no use, still I am facing same issue
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2022 02:37 AM
how about using inferschema one single time to create a correct DF, then create a schema from the df-schema.
something like this f.e.
from pyspark.sql.types import StructType
# Save schema from the original DataFrame into json:
schema_json = df.schema.json()
# Restore schema from json:
import json
new_schema = StructType.fromJson(json.loads(schema_json))