โ11-23-2022 10:40 PM
The date field is getting changed while reading data from source .xls file to the dataframe. In the source xl file all columns are strings but i am not sure why date column alone behaves differently
In Source file date is 1/24/1947.
In pyspark dataframe it is 1/24/47
Code used:
df=spark.read.format("com.crealytics.spark.excel").option("header","true").load("/mnt/dataplatform/Tenant_PK/Results.xlsx")
If I use option("inforscheme","true") the data coming properly , but I dont want use inforschema, Can any one suggest me any solution.
Thanks in advance
โ11-23-2022 10:52 PM
hi @Pradeep Namaniโ ,
could you plz try to run below one. I hope so it will work without inferschema
df=spark.read.format("csv").option("header","true").load("/mnt/dataplatform/Tenant_PK/Results.xlsx")
โ11-23-2022 11:12 PM
โ11-23-2022 11:13 PM
also u can refer below one
https://mayur-saparia7.medium.com/reading-excel-file-in-pyspark-databricks-notebook-c75a63181548
โ11-24-2022 03:08 AM
โ11-24-2022 02:37 AM
how about using inferschema one single time to create a correct DF, then create a schema from the df-schema.
something like this f.e.
from pyspark.sql.types import StructType
# Save schema from the original DataFrame into json:
schema_json = df.schema.json()
# Restore schema from json:
import json
new_schema = StructType.fromJson(json.loads(schema_json))
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group