Hi @bunny_9090,
Let me walk you through this. Your analysis of the root cause is spot on. Let me expand on what is happening and walk through the recommended approaches to address it.
WHY THIS HAPPENS -- IEEE 754 FLOATING-POINT REPRESENTATION
Both FLOAT and DOUBLE are IEEE 754 binary floating-point types. They store numbers in base-2 (binary), not base-10 (decimal). Many base-10 values that look "clean" to us -- like 0.1, 0.3, or 12.345 -- cannot be represented exactly in binary. They become infinitely repeating binary fractions, similar to how 1/3 = 0.3333... repeats forever in decimal.
FLOAT uses 32 bits (approximately 7 decimal digits of precision).
DOUBLE uses 64 bits (approximately 15-16 decimal digits of precision).
When your original FLOAT value stores 12.345, it actually stores the closest 32-bit binary approximation, which might be something like 12.345000267028809. At FLOAT precision (7 digits), this displays as 12.345 because the display rounds to the available precision. But when you widen to DOUBLE (16 digits), the extra bits now reveal that underlying binary approximation. The number did not change -- you are simply seeing more digits of what was always an approximation.
As the Databricks documentation states for both FLOAT and DOUBLE: "FLOAT/DOUBLE is a base-2 numeric type. When given a literal which is base-10 the representation may not be exact."
When Spark widens a FLOAT to a DOUBLE (whether through explicit CAST, Parquet schema evolution, or Delta type widening), it performs a direct binary promotion of the IEEE 754 representation. It does NOT re-interpret the decimal string "12.345" -- it promotes the raw 32-bit binary pattern to a 64-bit binary pattern. The extra 32 bits are filled with zeros in the mantissa, but the existing imprecision from the 32-bit representation is now visible at 64-bit display precision.
This is why you see results like:
FLOAT value: 12.345
DOUBLE value: 12.345000267028809
And why SUM aggregations show small differences -- those tiny per-row artifacts accumulate across many rows.
SOLUTION 1: USE DECIMAL TYPE (RECOMMENDED FOR EXACT PRECISION)
If your use case requires exact decimal precision (financial data, measurements that must match source systems exactly, audit-sensitive data), DECIMAL(p,s) is the right choice. DECIMAL stores numbers in base-10, so 12.345 is stored as exactly 12.345 with no binary approximation.
You specify the precision (total digits) and scale (digits after decimal point). The maximum precision supported in Databricks is 38.
The critical technique when migrating from FLOAT: cast to STRING first, then to DECIMAL. This preserves the displayed decimal value rather than carrying over the binary approximation. If you cast FLOAT directly to DECIMAL, you may still see the artifact digits.
Example:
-- Recommended: FLOAT -> STRING -> DECIMAL to preserve displayed values
CREATE TABLE my_catalog.my_schema.my_table_new AS
SELECT
CAST(CAST(float_column AS STRING) AS DECIMAL(10,4)) AS float_column,
other_columns
FROM my_catalog.my_schema.my_table_old;
Or in PySpark:
from pyspark.sql.functions import col
spark.read.table("my_catalog.my_schema.my_table") \
.withColumn("amount", col("amount").cast("string").cast("decimal(10,4)")) \
.write \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.saveAsTable("my_catalog.my_schema.my_table")
SOLUTION 2: ROUND AFTER WIDENING TO DOUBLE
If you prefer to stay with DOUBLE (for performance or compatibility reasons), you can round values after the conversion to remove the artifact digits:
SELECT ROUND(CAST(float_column AS DOUBLE), 4) AS float_column
FROM my_table;
This works well when you know the intended scale of your data. Note that ROUND on a DOUBLE still returns a DOUBLE, so extremely precise equality comparisons may still show tiny artifacts. For aggregations and display, this is usually sufficient.
SOLUTION 3: DELTA TYPE WIDENING WITH POST-PROCESSING
If you are using Delta type widening (available in Databricks Runtime 15.4 LTS and above, Public Preview), you can widen FLOAT to DOUBLE in-place without rewriting data files:
ALTER TABLE my_table SET TBLPROPERTIES ('delta.enableTypeWidening' = 'true');
ALTER TABLE my_table ALTER COLUMN float_column TYPE DOUBLE;
This is efficient because it stores the type change as metadata rather than rewriting data files. However, the same precision artifacts will appear. You would still need to apply ROUND or convert to DECIMAL in downstream queries or views if exact decimal representation matters.
More details: https://docs.databricks.com/delta/type-widening.html
BEST PRACTICES FOR NUMERIC PRECISION IN DATA MIGRATIONS
1. Choose the right type upfront. If you need exact decimal values (money, prices, rates), use DECIMAL(p,s) from the start. FLOAT and DOUBLE are appropriate for scientific measurements and calculations where small approximation errors are acceptable.
2. When migrating FLOAT to exact types, always go FLOAT -> STRING -> DECIMAL to avoid carrying binary artifacts into your DECIMAL column.
3. Document the intended precision. If a column represents currency with 2 decimal places, make it DECIMAL(18,2). If it represents a percentage to 4 places, use DECIMAL(10,4).
4. Test with known values before a full migration. Test with values you know are problematic (0.1, 0.3, numbers with many decimal places) to verify your conversion approach.
5. Use ROUND strategically. If staying with DOUBLE, wrap conversions in ROUND with the appropriate scale for your data.
6. Be cautious with equality comparisons on floating-point columns. Instead of direct = comparisons, use a tolerance range such as ABS(a - b) < 0.0001, or compare ROUNDed values.
DOCUMENTATION REFERENCES
- FLOAT data type: https://docs.databricks.com/sql/language-manual/data-types/float-type.html
- DOUBLE data type: https://docs.databricks.com/sql/language-manual/data-types/double-type.html
- DECIMAL data type: https://docs.databricks.com/sql/language-manual/data-types/decimal-type.html
- CAST function behavior: https://docs.databricks.com/sql/language-manual/functions/cast.html
- SQL data type rules (type promotion): https://docs.databricks.com/sql/language-manual/sql-ref-datatype-rules.html
- Delta type widening (Public Preview): https://docs.databricks.com/delta/type-widening.html
- ROUND function: https://docs.databricks.com/sql/language-manual/functions/round.html
The key takeaway is that your data is not being corrupted -- the DOUBLE type is simply revealing more digits of the same binary approximation that FLOAT was hiding. For exact decimal precision that matches your legacy system, DECIMAL(p,s) with the FLOAT -> STRING -> DECIMAL conversion path is the way to go.
Hope this helps!
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.