cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Dangerous implicit type conversions on 17.3 LTS.

Jarno
Visitor

Starting with DBR 17 running Spark 4.0, spark.sql.ansi.enabled is set to true by default. With the flag enabled, strings are implicitly converted to numbers in a very dangerous manner. Consider

SELECT 123='123';
SELECT 123='123X';

The first one is successful, but the second fails with a hard error when trying to convert '123X' to an INT. This poses a serious danger in setups where data might come from different sources and might unintentionally have numbers in string typed columns. If a non-number is introduced in the data at some point, data pipelines that previously worked suddenly crash. When spark.sql.ansi.enabled is true, all potentially failing implicit type conversions should be disallowed, such as implicit casts from STRING to INT.
 
Also, it seems that on DBR 17.3 (Spark 4.0.0), this fails, but on Spark 4.0.1 it results in NULL:
SELECT '123X'::int;
The Spark 4.0.1 behaviour is definitely more desirable as it makes the '::' cast operator a convenient way to sanitize column types.

1 REPLY 1

Hubert-Dudek
Esteemed Contributor III

Under ANSI rules, INT + STRING resolves to BIGINT (Long) that's why it crashes https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html. There are some examples when it works like 1Y or 1L.


Regarding 4.0.1, can you double-check ansi.enabled there? It's the behavior of ansi disabled.


My blog: https://databrickster.medium.com/

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now