<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Not able to retain precision while reading data from source file in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103909#M41602</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/88823"&gt;@Walter_C&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Thanks for the reply. I can specify the schema. But that would not be an ideal scenario in our case. The reason being:&lt;/P&gt;&lt;P&gt;1. We have around 20 different sources to read data from. Each source has different column having such values. And the precision can be different as well. For some columns it can be upto 3 decimal points and while for others it can have upto 5 decimal points.&amp;nbsp;&lt;/P&gt;&lt;P&gt;2. The schema can change as per time.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So, I need a generic solution where I can read the data, retaining the precision, in the bronze layer and then can apply the schema in the silver layer. That's why I thought of reading it as a string.&lt;/P&gt;&lt;P&gt;Is there any other way where we can retain precision ?&lt;/P&gt;</description>
    <pubDate>Thu, 02 Jan 2025 12:29:44 GMT</pubDate>
    <dc:creator>nikhil_kumawat</dc:creator>
    <dc:date>2025-01-02T12:29:44Z</dc:date>
    <item>
      <title>Not able to retain precision while reading data from source file</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103859#M41577</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am trying to read a csv file located in S3 bucket folder. The csv file contains around 50 columns out of which one of the column is "litre_val" which contains values like "&lt;SPAN&gt;60211.952&lt;/SPAN&gt;", "59164.608'. &lt;U&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;Upto 3 decimal points.&amp;nbsp;&lt;/FONT&gt;&lt;/U&gt;&lt;/P&gt;&lt;P&gt;Now to read this csv we are using spark API like below:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;spark_df = spark.read.format("csv").option("header", "true").load(f"s3://{s3_bucket}/{folder}/{filename}.csv")&lt;/LI-CODE&gt;&lt;P&gt;After reading the file and reading all the columns as &lt;STRONG&gt;string&lt;/STRONG&gt;, just to retain the precision, when we try to display the data, it reads only upto 2 decimal points like below&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="precision.png" style="width: 742px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13808i8FFC1C08CB93B2CB/image-size/large?v=v2&amp;amp;px=999" role="button" title="precision.png" alt="precision.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;All the values are being read upto 2 decimal points only. The data type is string and it should retain the precision. Even there is one more column which contains upto 4 decimal points. That is also being read upto two decimal point as string data type.&lt;/P&gt;&lt;P&gt;Can someone suggest how to retain precision in this case ?&lt;/P&gt;&lt;P&gt;Any help would highly be appreciated.&lt;/P&gt;</description>
      <pubDate>Thu, 02 Jan 2025 04:17:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103859#M41577</guid>
      <dc:creator>nikhil_kumawat</dc:creator>
      <dc:date>2025-01-02T04:17:07Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to retain precision while reading data from source file</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103905#M41600</link>
      <description>&lt;P&gt;Can you please try with something like:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;from pyspark.sql.types import StructType, StructField, StringType, DecimalType

# Define the schema with the appropriate precision and scale for decimal columns
schema = StructType([
    StructField("column1", StringType(), True),
    StructField("column2", StringType(), True),
    # Add other columns as needed
    StructField("litre_val", DecimalType(precision=10, scale=3), True),
    StructField("another_decimal_column", DecimalType(precision=10, scale=4), True)
    # Add other columns as needed
])

# Read the CSV file using the defined schema
spark_df = spark.read.format("csv") \
    .option("header", "true") \
    .schema(schema) \
    .load(f"s3://{s3_bucket}/{folder}/{filename}.csv")

# Display the DataFrame to verify the precision
spark_df.show()&lt;/LI-CODE&gt;
&lt;P&gt;In this example, replace &lt;CODE&gt;"column1"&lt;/CODE&gt;, &lt;CODE&gt;"column2"&lt;/CODE&gt;, etc., with the actual column names from your CSV file. The &lt;CODE&gt;DecimalType(precision=10, scale=3)&lt;/CODE&gt; specifies that the &lt;CODE&gt;litre_val&lt;/CODE&gt; column should be read as a decimal with a precision of 10 and a scale of 3. Adjust the precision and scale values as needed for your specific use case.&lt;/P&gt;</description>
      <pubDate>Thu, 02 Jan 2025 12:13:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103905#M41600</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2025-01-02T12:13:38Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to retain precision while reading data from source file</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103909#M41602</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/88823"&gt;@Walter_C&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Thanks for the reply. I can specify the schema. But that would not be an ideal scenario in our case. The reason being:&lt;/P&gt;&lt;P&gt;1. We have around 20 different sources to read data from. Each source has different column having such values. And the precision can be different as well. For some columns it can be upto 3 decimal points and while for others it can have upto 5 decimal points.&amp;nbsp;&lt;/P&gt;&lt;P&gt;2. The schema can change as per time.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So, I need a generic solution where I can read the data, retaining the precision, in the bronze layer and then can apply the schema in the silver layer. That's why I thought of reading it as a string.&lt;/P&gt;&lt;P&gt;Is there any other way where we can retain precision ?&lt;/P&gt;</description>
      <pubDate>Thu, 02 Jan 2025 12:29:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103909#M41602</guid>
      <dc:creator>nikhil_kumawat</dc:creator>
      <dc:date>2025-01-02T12:29:44Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to retain precision while reading data from source file</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103937#M41607</link>
      <description>&lt;P&gt;eg:&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;from pyspark.sql.functions import format_number, col&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;df = spark.read.parquet("&amp;lt;your-parquet-file-path&amp;gt;")&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;df = df.withColumn("&amp;lt;formatted-column&amp;gt;", format_number(col("&amp;lt;your-decimal-column&amp;gt;"), 2))&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Doc -&amp;nbsp;&lt;A href="https://docs.databricks.com/en/sql/language-manual/functions/format_number.html" target="_blank"&gt;https://docs.databricks.com/en/sql/language-manual/functions/format_number.html&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 02 Jan 2025 13:48:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103937#M41607</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-01-02T13:48:32Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to retain precision while reading data from source file</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103941#M41608</link>
      <description>&lt;P&gt;But I agree with&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/88823"&gt;@Walter_C&lt;/a&gt;&amp;nbsp;on specifying the schema and making sure String type does not cause any truncation.&lt;/P&gt;</description>
      <pubDate>Thu, 02 Jan 2025 13:52:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103941#M41608</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-01-02T13:52:56Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to retain precision while reading data from source file</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103983#M41618</link>
      <description>&lt;P&gt;I wonder if this is not strange display behavior. Can you use the show method on the dataframe instead of display and see the result? Or save the dataframe as a parquet and see what the column looks like after saving.&lt;/P&gt;</description>
      <pubDate>Thu, 02 Jan 2025 16:44:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103983#M41618</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-01-02T16:44:51Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to retain precision while reading data from source file</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103989#M41623</link>
      <description>&lt;P&gt;Agree with&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;, If the columns are read as strings, Spark doesn’t lose any precision or decimal places. It might simply be how data is shown in the console or Databricks UI. To see all decimals, disable truncation in .show()&amp;nbsp;or select the column directly in the UI. As szymon suggested an easy check is to store it somewhere else and then check the persisted data directly.&lt;/P&gt;
&lt;P&gt;E.g.:&amp;nbsp;df.select("litre_val").show(truncate=False) or&amp;nbsp;display(df.select("litre_val"))&lt;/P&gt;
&lt;P&gt;You’ll then see the full value. If you later need proper numeric types with guaranteed precision, apply a DecimalType in the silver layer or cast them at that stage&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 02 Jan 2025 17:17:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/103989#M41623</guid>
      <dc:creator>VZLA</dc:creator>
      <dc:date>2025-01-02T17:17:23Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to retain precision while reading data from source file</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/104016#M41636</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/34618"&gt;@VZLA&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Yes I did that already. Stored the dataframe as table in databricks and then displayed the content like below:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="nikhil_kumawat_0-1735870355437.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13835i304ADF184A4D1613/image-size/medium?v=v2&amp;amp;px=400" role="button" title="nikhil_kumawat_0-1735870355437.png" alt="nikhil_kumawat_0-1735870355437.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;And the datatype is string:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="nikhil_kumawat_1-1735870409146.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13836i7D0917B77C55EFDD/image-size/medium?v=v2&amp;amp;px=400" role="button" title="nikhil_kumawat_1-1735870409146.png" alt="nikhil_kumawat_1-1735870409146.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 03 Jan 2025 02:13:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/104016#M41636</guid>
      <dc:creator>nikhil_kumawat</dc:creator>
      <dc:date>2025-01-03T02:13:52Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to retain precision while reading data from source file</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/104048#M41649</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/139677"&gt;@nikhil_kumawat&lt;/a&gt;&amp;nbsp;can you provide more details to reproduce this and better help you? e.g.: sample data set, dbr version, reproducer code, etc.&lt;/P&gt;
&lt;P&gt;I'm having this sample data:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;csv_content = """column1,column2,litre_val,another_decimal_column
1,TypeA,60211.952,12.3459
2,TypeB,59164.608,45.6789
3,TypeC,12345.678,78.9012
"""&lt;/LI-CODE&gt;
&lt;P&gt;Which I'm then storing as csv file in my dbfs temp location. Then I'm reading it back without a schema, but simple inference:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;# Reading the CSV file without using any schema
df = spark.read.format("csv").option("header", "true").load("/some/path/to/test_data.csv")

&lt;/LI-CODE&gt;
&lt;P&gt;And when displaying it using:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;df.show(truncate=False)
df.printSchema()&lt;/LI-CODE&gt;
&lt;P&gt;I'm seeing the results as:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;+-------+-------+---------+----------------------+
|column1|column2|litre_val|another_decimal_column|
+-------+-------+---------+----------------------+
|1      |TypeA  |60211.952|12.3459               |
|2      |TypeB  |59164.608|45.6789               |
|3      |TypeC  |12345.678|78.9012               |
+-------+-------+---------+----------------------+

root
 |-- column1: string (nullable = true)
 |-- column2: string (nullable = true)
 |-- litre_val: string (nullable = true)
 |-- another_decimal_column: string (nullable = true)&lt;/LI-CODE&gt;
&lt;P&gt;Using display(), does not alter the results:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2025-01-03 at 11.20.27.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13841i17F5415D4E64AEC5/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screenshot 2025-01-03 at 11.20.27.png" alt="Screenshot 2025-01-03 at 11.20.27.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 03 Jan 2025 10:22:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-retain-precision-while-reading-data-from-source-file/m-p/104048#M41649</guid>
      <dc:creator>VZLA</dc:creator>
      <dc:date>2025-01-03T10:22:01Z</dc:date>
    </item>
  </channel>
</rss>

