cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Dangerous implicit type conversions on 17.3 LTS.

Jarno
New Contributor

Starting with DBR 17 running Spark 4.0, spark.sql.ansi.enabled is set to true by default. With the flag enabled, strings are implicitly converted to numbers in a very dangerous manner. Consider

SELECT 123='123';
SELECT 123='123X';

The first one is successful, but the second fails with a hard error when trying to convert '123X' to an INT. This poses a serious danger in setups where data might come from different sources and might unintentionally have numbers in string typed columns. If a non-number is introduced in the data at some point, data pipelines that previously worked suddenly crash. When spark.sql.ansi.enabled is true, all potentially failing implicit type conversions should be disallowed, such as implicit casts from STRING to INT.
 
Also, it seems that on DBR 17.3 (Spark 4.0.0), this fails, but on Spark 4.0.1 it results in NULL:
SELECT '123X'::int;
The Spark 4.0.1 behaviour is definitely more desirable as it makes the '::' cast operator a convenient way to sanitize column types.

4 REPLIES 4

Hubert-Dudek
Esteemed Contributor III

Under ANSI rules, INT + STRING resolves to BIGINT (Long) that's why it crashes https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html. There are some examples when it works like 1Y or 1L.


Regarding 4.0.1, can you double-check ansi.enabled there? It's the behavior of ansi disabled.


My blog: https://databrickster.medium.com/

Jarno
New Contributor

The flag is enabled in Spark 4.0.1. Spark documentation also states that it's enabled by default starting with Spark 4, so presumably it's going to be enabled in future releases too. Databricks documentation says the same for DBRs starting from 17, having Spark 4.

I could not find any indication that the ANSI standard says that the string '123' should be implicitly convertible to an integer. If there are cases when a string isn't implicitly convertible, then no string should be implicitly convertible. It shouldn't be the case that some strings are implicitly converted and other are not without any way of detecting code where potentially unsafe conversions occur. As it is, spark.sql.ansi.enabled is not safe to be enable in any production environment.

iyashk-DB
Databricks Employee
Databricks Employee

In DBR 17 (Spark 4.0), spark.sql.ansi.enabled defaults to true, so implicit coercions (like comparing INT to STRING) will parse the string and raise a runtime error if it’s malformed; to keep pipelines resilient, avoid relying on implicit casts and instead use safe casts such as try_cast or the ?:: operator to return NULL on bad input, or pre-filter/normalize strings before casting; note that :: is a strict cast alias of CAST and will error under ANSI, so use ?::/try_cast for NULL-on-error semantics, and only consider disabling ANSI (SET ANSI_MODE=false or SET spark.sql.ansi.enabled=false) if you explicitly want legacy leniency at the cost of data-quality protections.

Jarno
New Contributor

FYI, it seems I was mistaken about the behaviour of '::' on Spark 4.0.1. It does indeed work like CAST on both DBR 17.3 and Spark 4.0.1 and raises an exception on '123X'::int. The '?::' operator seems to be a Databricks only extension at the moment (like '::' used to be).

In any case, it is not feasible to protect all possible sites where an implicit cast occurs with try_casts or by some other means, because there is no way of knowing where such conversions might occur. You will know only after you get an unexpected exception. And since there's no exception handling mechanism in SQL, there's no way to automatically recover from it and requires manual intervention to fix. And it can be extremely difficult to find where the exception occurred, since the error message doesn't say and since a LIMIT clause might make a failing query run just fine as the offending value might not occur in the limited dataset. Disabling spark.sql.ansi.enabled seems to be the only viable option.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now