cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Understanding least common type in databricks

Dhruv-22
New Contributor III
I was reading the data type rules and found about least common type.

I have a doubt. What is the least common type of STRING and INT? The referred link gives the following example saying the least common type is BIGINT.

-- The least common type between an INT and STRING is BIGINT
> SELECT typeof(coalesce(5, '6'));
  BIGINT

However, running this on Databricks 13.3, Spark 3.4.1 and Scala 2.12 configurations gives the following output.

> SELECT typeof(coalesce(5, '6'));
  STRING

If I run explain on the query, then I get the following output.

> EXPLAIN EXTENDED SELECT typeof(coalesce(5, '6'));
== Analyzed Logical Plan ==
typeof(coalesce(5, 6)): string
Project [typeof(coalesce(cast(5 as string), 6)) AS typeof(coalesce(5, 6))#239637]
+- OneRowRelation

It shows that INT is cast to STRING so the least common type should be STRING.

However, if I run an equal to operation between STRING and INT, then the explain query gives different output.

> EXPLAIN EXTENDED SELECT '1.00' = 1;
== Analyzed Logical Plan ==
(1.00 = 1): boolean
Project [(cast(1.00 as int) = 1) AS (1.00 = 1)#239661]
+- OneRowRelation

Here STRING is cast to INT.

Both coalesce and = rely on the least common type and give different results. So, what is exactly the least common type of STRING and INT? Also, which is more narrower STRING or INT? Any simple data type can be converted to STRING without any data loss, like an INT can be converted to STRING without any loss. However, the reverse is not true. If a STRING containing a decimal value is converted to INT, there would be a value loss. So shouldn't STRING be the wider type?

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @Dhruv-22The concept of the least common type can indeed be a bit tricky, especially when dealing with different data types like STRING and INT.

Let’s dive into this and clarify the behaviour in Apache Spark™ and Databricks.

  1. Coalesce Function:

    • The coalesce function in Spark is used to return the first non-null value from a list of expressions.
    • When using coalesce, the least common type is determined by the order of the arguments.
    • In your example, SELECT typeof(coalesce(5, '6'));, the order of arguments is INT followed by STRING.
    • Since BIGINT is a wider type that can accommodate both INT and STRING, it is chosen as the result type.
    • However, in your specific environment (Databricks 13.3, Spark 3.4.1, Scala 2.12), the result is STRING. This might be due to specific configuration or version differences.
  2. Equal To (=) Operation:

    • When performing an equal to operation (=), Spark performs implicit type casting.
    • In your second example, SELECT '1.00' = 1;, the STRING '1.00' is successfully cast to an INT.
    • As a result, both operands have the same type (INT), and the comparison works as expected.
  3. Wider Type:

    • In general, STRING is considered wider than INT because any simple data type can be safely converted to STRING without any loss of data.
    • For example:
      • An INT can be converted to STRING without any loss.
      • However, the reverse is not true: if a STRING containing a decimal value is converted to INT, there would be a value loss (truncation of decimal places).
  4. Summary:

    • The least common type between STRING and INT is STRING.
    • Both coalesce and = rely on the least common type, but they might produce different result types depending on the operation and the order of the arguments.

Remember that data type behavior can vary based on the specific environment and configuration. If you encounter unexpected results, it’s essential to verify the behavior in your specific context. 🤓🔍

1: Understanding least common type in Databricks - Stack Overflow

 

Dhruv-22
New Contributor III

Hey, if STRING is the wider type then why did you say that the least common type between STRING and INT is STRING?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group