cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I preserve the data type of the delta tables while writing to Azure blob storage ?

Tiwarisk
New Contributor II
I am writing a file using this but the data type of columns get changed while reading.
 df.write.format("com.crealytics.spark.excel").option("header", "true").mode("overwrite").save(path)
 
Due to this I have to manual change every time as I can't change the datatype of my target. Is there any way I can write my file with the same datatypes?
 
 
[DELTA_FAILED_TO_MERGE_FIELDS] Failed to merge fields 'quantity_volume' and 'quantity_volume' SQLSTATE: 22005
File <command-788927452842747>, line 6 4 df.printSchema() 5 df.count() ----> 6 df.write.mode("overwrite").saveAsTable("cc.pg_vendor_product_id_hierarchy")

File /databricks/spark/python/pyspark/errors/exceptions/captured.py:230, in capture_sql_exception.<locals>.deco(*a, **kw) 226 converted = convert_exception(e.java_exception) 227 if not isinstance(converted, UnknownException): 228 # Hide where the exception came from that shows a non-Pythonic 229 # JVM exception message. --> 230 raise converted from None 231 else: 232 raise
 
4 REPLIES 4

ashraf1395
New Contributor III

Hi there @Tiwarisk,
if this is the major issue 


@Tiwarisk wrote:
I am writing a file using this but the data type of columns get changed while reading.
 

You can explicitly specify your table schema like this

 

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

schema = StructType([
    StructField("column1", StringType(), True),
    StructField("column2", IntegerType(), True),
    StructField("column3", DoubleType(), True)
])

 

Then you can read the Excel file like this 

 

// Read the Excel file with the specified schema
val df = spark.read
  .format("com.crealytics.spark.excel")
  .option("header", "true")
  .schema(schema)  // Specify the schema here
  .load(path)

 

After this when you write it won't cause trouble because When writing data to an Excel file using the `com.crealytics.spark.excel` format, you might encounter issues where the data types of the columns are altered. This happens because the Excel format doesn't natively support all Spark data types, and the conversion might not be perfect.


@Tiwarisk wrote:
 df.write.format("com.crealytics.spark.excel").option("header", "true").mode("overwrite").save(path)
 

 

jacovangelder
Contributor III

I checked the library you are using to write to Excel and it seems there is a new version available that has improved data type handling.

https://github.com/crealytics/spark-excel

 

  • Spark-Excel V2 with data source API V2.0+, which supports loading from multiple files, corrupted record handling and some improvement on handling data types. See below for further details

To use V2 implementation, just change your .format from .format("com.crealytics.spark.excel") to .format("excel").

Check the github readme for details. If your dataframe has the same datatypes as the Excel table, I'm hoping this gives you some more luck. 

 

UmaMahesh1
Honored Contributor III

Do you need to write the data again in excel format ? Do you need it in that format ? If yes, while reading the excel file back, are you inferring the schema of the file ?

Tiwarisk
New Contributor II

yes inferschema is true

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!