cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Precision Variance Observed in FLOAT to DOUBLE Data Migration to Delta Tables

bunny_9090
New Contributor

Hi Team,

We would like to bring to your attention a precision-related variance observed during data migration from our legacy platform into db Delta tables.

In the legacy system, several numeric columns are defined using the FLOAT data type. During ingestion into the data lake, these values are written to Parquet format and interpreted as DOUBLE precision. The same DOUBLE representation is then used when loading the data into Delta tables.

While there is no data loss at the row level, we are noticing very small differences in decimal precision after the conversion from FLOAT (legacy) to DOUBLE (Data bricks). These differences are typically at the far decimal places but become noticeable during aggregations such as SUM, where the final result differs slightly from the totals calculated in the legacy platform.

Our understanding is that this behaviour is related to IEEE 754 floating-point representation differences and how intermediate rounding is handled across systems during format conversion and aggregation. Since FLOAT and DOUBLE are both approximate numeric types, slight binary representation changes during the migration path (Legacy FLOAT → Parquet DOUBLE → Delta DOUBLE) appear to be introducing this minor variance.

We would appreciate your guidance on the following:

• Recommended best practices to minimise precision drift when ingesting FLOAT data into Delta
• Whether explicit casting to DECIMAL during ingestion would be advisable for such columns
• Any Data bricks-specific configuration or optimisation that can help maintain consistent aggregation results

Please note that record counts and row-level values remain aligned, and this issue only affects aggregated totals at a very small precision level.

Looking forward to your inputs.

Regards,
Divyansh Chouhan

1 REPLY 1

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @bunny_9090,

Let me walk you through this. Your analysis of the root cause is spot on. Let me expand on what is happening and walk through the recommended approaches to address it.


WHY THIS HAPPENS -- IEEE 754 FLOATING-POINT REPRESENTATION

Both FLOAT and DOUBLE are IEEE 754 binary floating-point types. They store numbers in base-2 (binary), not base-10 (decimal). Many base-10 values that look "clean" to us -- like 0.1, 0.3, or 12.345 -- cannot be represented exactly in binary. They become infinitely repeating binary fractions, similar to how 1/3 = 0.3333... repeats forever in decimal.

FLOAT uses 32 bits (approximately 7 decimal digits of precision).
DOUBLE uses 64 bits (approximately 15-16 decimal digits of precision).

When your original FLOAT value stores 12.345, it actually stores the closest 32-bit binary approximation, which might be something like 12.345000267028809. At FLOAT precision (7 digits), this displays as 12.345 because the display rounds to the available precision. But when you widen to DOUBLE (16 digits), the extra bits now reveal that underlying binary approximation. The number did not change -- you are simply seeing more digits of what was always an approximation.

As the Databricks documentation states for both FLOAT and DOUBLE: "FLOAT/DOUBLE is a base-2 numeric type. When given a literal which is base-10 the representation may not be exact."

When Spark widens a FLOAT to a DOUBLE (whether through explicit CAST, Parquet schema evolution, or Delta type widening), it performs a direct binary promotion of the IEEE 754 representation. It does NOT re-interpret the decimal string "12.345" -- it promotes the raw 32-bit binary pattern to a 64-bit binary pattern. The extra 32 bits are filled with zeros in the mantissa, but the existing imprecision from the 32-bit representation is now visible at 64-bit display precision.

This is why you see results like:
FLOAT value: 12.345
DOUBLE value: 12.345000267028809

And why SUM aggregations show small differences -- those tiny per-row artifacts accumulate across many rows.


SOLUTION 1: USE DECIMAL TYPE (RECOMMENDED FOR EXACT PRECISION)

If your use case requires exact decimal precision (financial data, measurements that must match source systems exactly, audit-sensitive data), DECIMAL(p,s) is the right choice. DECIMAL stores numbers in base-10, so 12.345 is stored as exactly 12.345 with no binary approximation.

You specify the precision (total digits) and scale (digits after decimal point). The maximum precision supported in Databricks is 38.

The critical technique when migrating from FLOAT: cast to STRING first, then to DECIMAL. This preserves the displayed decimal value rather than carrying over the binary approximation. If you cast FLOAT directly to DECIMAL, you may still see the artifact digits.

Example:

-- Recommended: FLOAT -> STRING -> DECIMAL to preserve displayed values
CREATE TABLE my_catalog.my_schema.my_table_new AS
SELECT
CAST(CAST(float_column AS STRING) AS DECIMAL(10,4)) AS float_column,
other_columns
FROM my_catalog.my_schema.my_table_old;

Or in PySpark:

from pyspark.sql.functions import col

spark.read.table("my_catalog.my_schema.my_table") \
.withColumn("amount", col("amount").cast("string").cast("decimal(10,4)")) \
.write \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.saveAsTable("my_catalog.my_schema.my_table")


SOLUTION 2: ROUND AFTER WIDENING TO DOUBLE

If you prefer to stay with DOUBLE (for performance or compatibility reasons), you can round values after the conversion to remove the artifact digits:

SELECT ROUND(CAST(float_column AS DOUBLE), 4) AS float_column
FROM my_table;

This works well when you know the intended scale of your data. Note that ROUND on a DOUBLE still returns a DOUBLE, so extremely precise equality comparisons may still show tiny artifacts. For aggregations and display, this is usually sufficient.


SOLUTION 3: DELTA TYPE WIDENING WITH POST-PROCESSING

If you are using Delta type widening (available in Databricks Runtime 15.4 LTS and above, Public Preview), you can widen FLOAT to DOUBLE in-place without rewriting data files:

ALTER TABLE my_table SET TBLPROPERTIES ('delta.enableTypeWidening' = 'true');
ALTER TABLE my_table ALTER COLUMN float_column TYPE DOUBLE;

This is efficient because it stores the type change as metadata rather than rewriting data files. However, the same precision artifacts will appear. You would still need to apply ROUND or convert to DECIMAL in downstream queries or views if exact decimal representation matters.

More details: https://docs.databricks.com/delta/type-widening.html


BEST PRACTICES FOR NUMERIC PRECISION IN DATA MIGRATIONS

1. Choose the right type upfront. If you need exact decimal values (money, prices, rates), use DECIMAL(p,s) from the start. FLOAT and DOUBLE are appropriate for scientific measurements and calculations where small approximation errors are acceptable.

2. When migrating FLOAT to exact types, always go FLOAT -> STRING -> DECIMAL to avoid carrying binary artifacts into your DECIMAL column.

3. Document the intended precision. If a column represents currency with 2 decimal places, make it DECIMAL(18,2). If it represents a percentage to 4 places, use DECIMAL(10,4).

4. Test with known values before a full migration. Test with values you know are problematic (0.1, 0.3, numbers with many decimal places) to verify your conversion approach.

5. Use ROUND strategically. If staying with DOUBLE, wrap conversions in ROUND with the appropriate scale for your data.

6. Be cautious with equality comparisons on floating-point columns. Instead of direct = comparisons, use a tolerance range such as ABS(a - b) < 0.0001, or compare ROUNDed values.


DOCUMENTATION REFERENCES

- FLOAT data type: https://docs.databricks.com/sql/language-manual/data-types/float-type.html
- DOUBLE data type: https://docs.databricks.com/sql/language-manual/data-types/double-type.html
- DECIMAL data type: https://docs.databricks.com/sql/language-manual/data-types/decimal-type.html
- CAST function behavior: https://docs.databricks.com/sql/language-manual/functions/cast.html
- SQL data type rules (type promotion): https://docs.databricks.com/sql/language-manual/sql-ref-datatype-rules.html
- Delta type widening (Public Preview): https://docs.databricks.com/delta/type-widening.html
- ROUND function: https://docs.databricks.com/sql/language-manual/functions/round.html

The key takeaway is that your data is not being corrupted -- the DOUBLE type is simply revealing more digits of the same binary approximation that FLOAT was hiding. For exact decimal precision that matches your legacy system, DECIMAL(p,s) with the FLOAT -> STRING -> DECIMAL conversion path is the way to go.

Hope this helps!

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.