Databricks Community

Splush_ · ‎01-28-2025

Hey,

Im trying to save the contents of a database table to a databrick delta table. The schema right from the database returns the number fields as decimal(38, 10). At least one of the values is too large for this data type. So I try to convert it using:

for field in df.schema.fields:
    if isinstance(field.dataType, DecimalType):
        df = df.withColumn(field.name, col(field.name).cast('double'))

When I print the schema, every decimal is indeed casted to a double.Though I always seem to get this error:

[NUMERIC_VALUE_OUT_OF_RANGE.WITHOUT_SUGGESTION] The 339600000000000000000000000000000000000.0000000000 rounded half up from 339600000000000000000000000000000000000 cannot be represented as Decimal(38, 10). SQLSTATE: 22003

which does not make sense at all - is there anything im missing?

Best regards

Samuel

VZLA · ‎01-29-2025

Hello, thank you for your question.

The issue you’re facing stems from a value exceeding the range allowed by Decimal(38,10) before it can be successfully cast to Double. This happens because the value is already invalid for the Decimal type, causing an overflow during Spark's processing.

Suggested Solution

Convert to String First: By converting the Decimal column to String before casting to Double, you can avoid the overflow issue. Once converted to a string, Spark can then interpret the value as a Double.
Use Conditional Replacement: Safeguard against uncastable values by using a condition to replace problematic values with null or a default fallback. This avoids runtime errors while maintaining data integrity.
Filter Out Extreme Values: Before performing the cast, filter rows with values that exceed the allowable range for Decimal(38,10). This ensures only valid values are processed.
Inspect the Data: Examine the range of values in the source table to confirm the scope of the issue. This can guide decisions on whether filtering or another approach is most appropriate.

Hope it helps!

Example:

from pyspark.sql.functions import col, when
from pyspark.sql.types import DecimalType

# Iterate through schema fields and process DecimalType columns
for field in df.schema.fields:
    if isinstance(field.dataType, DecimalType):
        # Filter out extreme values
        df = df.filter(col(field.name) <= 10**38)
        # Handle invalid values by first casting to String and then to Double
        df = df.withColumn(
            field.name,
            when(
                col(field.name).cast("string").cast("double").isNotNull(),
                col(field.name).cast("string").cast("double")
            ).otherwise(None)
        )

We can still combine the filtering and transformation into a single pass to reduce computation:

from pyspark.sql.functions import col, when
from pyspark.sql.types import DecimalType

# Process DecimalType columns in one step
for field in df.schema.fields:
    if isinstance(field.dataType, DecimalType):
        df = df.withColumn(
            field.name,
            when(
                (col(field.name) <= 10**38) & col(field.name).cast("string").cast("double").isNotNull(),
                col(field.name).cast("string").cast("double")
            ).otherwise(None)  # Replace invalid or out-of-range values with null
        )

Splush_ · ‎01-29-2025

Hey VZLA,

thank you so much for your answer and help!

Sadly, this code did not resolve my issue. First I have changed your code to put a Decimal around the 10**38 to make it work.

Executing this did indeed change the schema. But after running

df.where(col('PART_NR') == '6016-HP SIEGEL      ').show()

I get the exact same error after around 6 minutes of this thing running 😞

filipniziol · ‎01-29-2025

Hi @Splush_ ,

What is the source of the data? How you end up with dataframe with decimal(38,10) field that exceeds this number?
Do you have a chance to have this field as string?

The below code starts with string values and successfully converts to double:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DecimalType
from pyspark.sql.functions import col, when, isnan, isnull

# Define schema with String type for large values
schema = StructType([
    StructField("id", StringType(), True),
    StructField("value_str", StringType(), True)  # Store as string to avoid immediate precision errors
])

# Test data with values that exceed Decimal(38,10)
data = [
    ("valid_1", "123.4567890123"),  # Normal
    ("valid_2", "9876543210.1234567890"),  # Large but valid
    ("invalid_1", "9" * 38 + ".9999999999"),  # 99999999999999999999999999999999999999.9999999999 (Exceeds Decimal(38,10))
    ("invalid_2", "1E40"),  # Exceeds max precision
    ("invalid_3", "-1E40"),  # Negative overflow
    ("null_case", None)  # Null case
]

# Create DataFrame
df = spark.createDataFrame(data, schema=schema)

# Show initial data
display(df)

df_fixed = df.withColumn(
    "decimal_value",
    when(
        ~isnan(col("value_str").cast("double")),  # Ensure numeric format
        col("value_str").cast("double")
    ).otherwise(None)  # Replace invalid values with null
)

# Show processed data
display(df_fixed)

The result:

Splush_ · ‎01-29-2025

Hey @filipniziol,

Thank you so much for your response and the code snipped you provided! You are right - it should not be possible that this number is inside of this DataFrame! The problem seems to be the Spark JDBC connector. The data source is a Oracle Database. The column in question is of type float with a scale of 127 and a prescision of 126.

However Spark translates this column directly to a Decimal(38, 10) Type and automatically infers this from the source. Once it wants to cast the type, apparently it fails due to the number oviously being too large for the type it is supposed to be.

I have found out, that this float type is stored in the database as type "number" (Data Types). So it is apparently default behavior for Spark to translate this into a Decimal(38,10) (Solved: Re: Spark error - Decimal precision exceeds max pr... - Cloudera Community - 392840).

Sadly this info doesnt really help to solve my problem - Im just unable to change it to anything else. Maybe there is an option t

Best regards

Samuel

Splush_ · ‎01-30-2025

Hey guys,

Thank you a lot for your help. Since this is taking days alreary, I have asked the application owners of the database to delete these values for me. Apparently they are weights in gram for whatever products - so the problematic rows are heavier than the weight of the whole solar system 😄

Since the rows are now removed I cannot test any further.

Best regards Samuel