cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Saving Number field as String in Databricks

Manju1202
New Contributor II

Do we see any risk of saving a Number field as String?

Will we use any functionality/feature if we save as String ?

Will it have any impact on performance ?

3 REPLIES 3

pvignesh92
Honored Contributor

Hi @Manju Chugani​. Yes. In Short, it is not really recommended to save the columns as string if all the values are expected to be numbers.

Here are some of them

  1. Storage Space: Storing numbers as strings can take up more storage space than storing them as numbers. This is because strings are typically represented using Unicode characters, which require more bits to store than the binary representation of numbers.
  2. Performance: Using strings can be slower than using numbers when performing calculations or other operations on the data. Converting strings to numbers before performing calculations can add overhead and reduce performance.
  3. Sorting and Filtering: Sorting and filtering operations can be slower with strings than with numbers. Sorting strings requires additional steps such as converting the strings to a common format and comparing them character by character.
  4. Type Checking: Using strings can make it more difficult to ensure that the data is of the correct type. This can lead to errors and inconsistencies in the data.
  5. Data Integrity: Storing values as strings can increase the risk of data integrity issues, such as data input errors or unexpected data formats. This can make it more difficult to analyze the data and can lead to inaccurate results.

Thank you for the response, the info is very helpful.

Do you see any issue with any mathematical function - other than performance? Will the outcome of any mathematical functions be different for string vs number?

pvignesh92
Honored Contributor

Hi @Manju Chugani​ , Mathematical functions will definitely be a concern. In my observations before, we store dates as string some times and greater than or less than works fine. But when it comes to min and max, the integers as strings might misbehave.

You can try them by storing some integer values as a string in a dataframe and try the sum, min and max and few more functions and you can get to see the differences.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.