Wednesday
Hi,
I am trying to read a csv file located in S3 bucket folder. The csv file contains around 50 columns out of which one of the column is "litre_val" which contains values like "60211.952", "59164.608'. Upto 3 decimal points.
Now to read this csv we are using spark API like below:
spark_df = spark.read.format("csv").option("header", "true").load(f"s3://{s3_bucket}/{folder}/{filename}.csv")
After reading the file and reading all the columns as string, just to retain the precision, when we try to display the data, it reads only upto 2 decimal points like below
All the values are being read upto 2 decimal points only. The data type is string and it should retain the precision. Even there is one more column which contains upto 4 decimal points. That is also being read upto two decimal point as string data type.
Can someone suggest how to retain precision in this case ?
Any help would highly be appreciated.
Thursday
Can you please try with something like:
from pyspark.sql.types import StructType, StructField, StringType, DecimalType
# Define the schema with the appropriate precision and scale for decimal columns
schema = StructType([
StructField("column1", StringType(), True),
StructField("column2", StringType(), True),
# Add other columns as needed
StructField("litre_val", DecimalType(precision=10, scale=3), True),
StructField("another_decimal_column", DecimalType(precision=10, scale=4), True)
# Add other columns as needed
])
# Read the CSV file using the defined schema
spark_df = spark.read.format("csv") \
.option("header", "true") \
.schema(schema) \
.load(f"s3://{s3_bucket}/{folder}/{filename}.csv")
# Display the DataFrame to verify the precision
spark_df.show()
In this example, replace "column1"
, "column2"
, etc., with the actual column names from your CSV file. The DecimalType(precision=10, scale=3)
specifies that the litre_val
column should be read as a decimal with a precision of 10 and a scale of 3. Adjust the precision and scale values as needed for your specific use case.
Thursday
Hi @Walter_C
Thanks for the reply. I can specify the schema. But that would not be an ideal scenario in our case. The reason being:
1. We have around 20 different sources to read data from. Each source has different column having such values. And the precision can be different as well. For some columns it can be upto 3 decimal points and while for others it can have upto 5 decimal points.
2. The schema can change as per time.
So, I need a generic solution where I can read the data, retaining the precision, in the bronze layer and then can apply the schema in the silver layer. That's why I thought of reading it as a string.
Is there any other way where we can retain precision ?
Thursday
eg:
from pyspark.sql.functions import format_number, col
df = spark.read.parquet("<your-parquet-file-path>")
df = df.withColumn("<formatted-column>", format_number(col("<your-decimal-column>"), 2))
Doc - https://docs.databricks.com/en/sql/language-manual/functions/format_number.html
Thursday
But I agree with @Walter_C on specifying the schema and making sure String type does not cause any truncation.
Thursday
I wonder if this is not strange display behavior. Can you use the show method on the dataframe instead of display and see the result? Or save the dataframe as a parquet and see what the column looks like after saving.
Thursday - last edited Thursday
Agree with @szymon_dybczak, If the columns are read as strings, Spark doesn’t lose any precision or decimal places. It might simply be how data is shown in the console or Databricks UI. To see all decimals, disable truncation in .show() or select the column directly in the UI. As szymon suggested an easy check is to store it somewhere else and then check the persisted data directly.
E.g.: df.select("litre_val").show(truncate=False) or display(df.select("litre_val"))
You’ll then see the full value. If you later need proper numeric types with guaranteed precision, apply a DecimalType in the silver layer or cast them at that stage
Thursday
Hi @VZLA and @szymon_dybczak
Yes I did that already. Stored the dataframe as table in databricks and then displayed the content like below:
And the datatype is string:
yesterday
@nikhil_kumawat can you provide more details to reproduce this and better help you? e.g.: sample data set, dbr version, reproducer code, etc.
I'm having this sample data:
csv_content = """column1,column2,litre_val,another_decimal_column
1,TypeA,60211.952,12.3459
2,TypeB,59164.608,45.6789
3,TypeC,12345.678,78.9012
"""
Which I'm then storing as csv file in my dbfs temp location. Then I'm reading it back without a schema, but simple inference:
# Reading the CSV file without using any schema
df = spark.read.format("csv").option("header", "true").load("/some/path/to/test_data.csv")
And when displaying it using:
df.show(truncate=False)
df.printSchema()
I'm seeing the results as:
+-------+-------+---------+----------------------+
|column1|column2|litre_val|another_decimal_column|
+-------+-------+---------+----------------------+
|1 |TypeA |60211.952|12.3459 |
|2 |TypeB |59164.608|45.6789 |
|3 |TypeC |12345.678|78.9012 |
+-------+-------+---------+----------------------+
root
|-- column1: string (nullable = true)
|-- column2: string (nullable = true)
|-- litre_val: string (nullable = true)
|-- another_decimal_column: string (nullable = true)
Using display(), does not alter the results:
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group