<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic When reading a csv file with Spark.read, the data is not loading in the appropriate column while pas in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/when-reading-a-csv-file-with-spark-read-the-data-is-not-loading/m-p/66090#M33014</link>
    <description>&lt;P&gt;I am trying to read a csv file from storage location using spark.read function. Also, i am explicitly passing the schema to the function. However, the data is not loading in proper column of the dataframe. Following are the code details:&lt;/P&gt;&lt;P&gt;from pyspark.sql.types import StructType, StructField, StringType, DateType, DoubleType&lt;/P&gt;&lt;P&gt;# Define the schema&lt;BR /&gt;schema = StructType([&lt;BR /&gt;StructField('TRANSACTION', StringType(), True),&lt;BR /&gt;StructField('FROM', StringType(), True),&lt;BR /&gt;StructField('TO', StringType(), True),&lt;BR /&gt;StructField('DA_RATE', DateType(), True),&lt;BR /&gt;StructField('CURNCY_F', StringType(), True),&lt;BR /&gt;StructField('CURNCY_T', StringType(), True)&lt;BR /&gt;])&lt;/P&gt;&lt;P&gt;# Read the CSV file with the specified schema&lt;BR /&gt;df = spark.read.format("csv") \&lt;BR /&gt;.option("header", "true") \&lt;BR /&gt;.option("delimiter", "|") \&lt;BR /&gt;.schema(schema) \&lt;BR /&gt;.load("abfss://xyz@abc.dfs.core.windows.net/my/2024-04-10/abc_2*.csv")&lt;/P&gt;&lt;P&gt;**Data in the csv file**&lt;/P&gt;&lt;P&gt;DA_RATE|CURNCY_F|CURNCY_T&lt;BR /&gt;2024-02-26|AAA|MMM&lt;BR /&gt;2024-02-26|AAA|NNN&lt;BR /&gt;2024-02-26|BBB|YYY&lt;BR /&gt;2024-02-26|CCC|KKK&lt;BR /&gt;2024-02-27|DDD|SSS&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Output I am getting&lt;/P&gt;&lt;P&gt;TRANSACTION FROM TO DA_RATE CURNCY_F CURNCY_T&lt;BR /&gt;2024-02-26 AAA MMM null null null&lt;BR /&gt;2024-02-26 AAA NNN null null null&lt;BR /&gt;2024-02-26 BBB YYY null null null&lt;BR /&gt;2024-02-26 CCC KKK null null null&lt;/P&gt;&lt;P&gt;**Output I am expected**&lt;/P&gt;&lt;P&gt;TRANSACTION FROM TO DA_RATE CURNCY_F CURNCY_T&lt;BR /&gt;null null null 2024-02-26 AAA MMM&lt;BR /&gt;null null null 2024-02-26 AAA NNN&lt;BR /&gt;null null null 2024-02-26 BBB YYY&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 11 Apr 2024 19:21:45 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2024-04-11T19:21:45Z</dc:date>
    <item>
      <title>When reading a csv file with Spark.read, the data is not loading in the appropriate column while pas</title>
      <link>https://community.databricks.com/t5/data-engineering/when-reading-a-csv-file-with-spark-read-the-data-is-not-loading/m-p/66090#M33014</link>
      <description>&lt;P&gt;I am trying to read a csv file from storage location using spark.read function. Also, i am explicitly passing the schema to the function. However, the data is not loading in proper column of the dataframe. Following are the code details:&lt;/P&gt;&lt;P&gt;from pyspark.sql.types import StructType, StructField, StringType, DateType, DoubleType&lt;/P&gt;&lt;P&gt;# Define the schema&lt;BR /&gt;schema = StructType([&lt;BR /&gt;StructField('TRANSACTION', StringType(), True),&lt;BR /&gt;StructField('FROM', StringType(), True),&lt;BR /&gt;StructField('TO', StringType(), True),&lt;BR /&gt;StructField('DA_RATE', DateType(), True),&lt;BR /&gt;StructField('CURNCY_F', StringType(), True),&lt;BR /&gt;StructField('CURNCY_T', StringType(), True)&lt;BR /&gt;])&lt;/P&gt;&lt;P&gt;# Read the CSV file with the specified schema&lt;BR /&gt;df = spark.read.format("csv") \&lt;BR /&gt;.option("header", "true") \&lt;BR /&gt;.option("delimiter", "|") \&lt;BR /&gt;.schema(schema) \&lt;BR /&gt;.load("abfss://xyz@abc.dfs.core.windows.net/my/2024-04-10/abc_2*.csv")&lt;/P&gt;&lt;P&gt;**Data in the csv file**&lt;/P&gt;&lt;P&gt;DA_RATE|CURNCY_F|CURNCY_T&lt;BR /&gt;2024-02-26|AAA|MMM&lt;BR /&gt;2024-02-26|AAA|NNN&lt;BR /&gt;2024-02-26|BBB|YYY&lt;BR /&gt;2024-02-26|CCC|KKK&lt;BR /&gt;2024-02-27|DDD|SSS&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Output I am getting&lt;/P&gt;&lt;P&gt;TRANSACTION FROM TO DA_RATE CURNCY_F CURNCY_T&lt;BR /&gt;2024-02-26 AAA MMM null null null&lt;BR /&gt;2024-02-26 AAA NNN null null null&lt;BR /&gt;2024-02-26 BBB YYY null null null&lt;BR /&gt;2024-02-26 CCC KKK null null null&lt;/P&gt;&lt;P&gt;**Output I am expected**&lt;/P&gt;&lt;P&gt;TRANSACTION FROM TO DA_RATE CURNCY_F CURNCY_T&lt;BR /&gt;null null null 2024-02-26 AAA MMM&lt;BR /&gt;null null null 2024-02-26 AAA NNN&lt;BR /&gt;null null null 2024-02-26 BBB YYY&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 11 Apr 2024 19:21:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/when-reading-a-csv-file-with-spark-read-the-data-is-not-loading/m-p/66090#M33014</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-04-11T19:21:45Z</dc:date>
    </item>
    <item>
      <title>Re: When reading a csv file with Spark.read, the data is not loading in the appropriate column while</title>
      <link>https://community.databricks.com/t5/data-engineering/when-reading-a-csv-file-with-spark-read-the-data-is-not-loading/m-p/66208#M33053</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I noticed that in the scheme you are creating, there are more columns than in your csv file, I was able to understand that the final result needs to include the 6 columns&lt;/P&gt;&lt;P&gt;I would use withColumn, for the 3 columns that do not exist in the file&lt;/P&gt;&lt;P&gt;Below is an example&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df = &lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;spark.read&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"csv"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"header"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"delimiter"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt;"|"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.load&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"pathFile"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.withColumn&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"TRANSACTION"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt; lit&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;'null'&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;SPAN&gt;.cast&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;StringType&lt;/SPAN&gt;&lt;SPAN&gt;()))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.withColumn&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"FROM"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt; lit&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;'null'&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;SPAN&gt;.cast&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;StringType&lt;/SPAN&gt;&lt;SPAN&gt;()))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.withColumn&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"TO"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt; lit&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;'null'&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;SPAN&gt;.cast&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;StringType&lt;/SPAN&gt;&lt;SPAN&gt;()))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Hope this helps&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Sun, 14 Apr 2024 12:54:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/when-reading-a-csv-file-with-spark-read-the-data-is-not-loading/m-p/66208#M33053</guid>
      <dc:creator>ThomazRossito</dc:creator>
      <dc:date>2024-04-14T12:54:50Z</dc:date>
    </item>
    <item>
      <title>Re: When reading a csv file with Spark.read, the data is not loading in the appropriate column while</title>
      <link>https://community.databricks.com/t5/data-engineering/when-reading-a-csv-file-with-spark-read-the-data-is-not-loading/m-p/66215#M33054</link>
      <description>&lt;P&gt;Hi , i would suggest to approach as suggested by Thomaz Rossito,&lt;/P&gt;&lt;P&gt;but maybe you can give it as an try like swapping the struct field order like this following&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;schema = StructType([&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;StructField('DA_RATE', DateType(), True),&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;StructField('CURNCY_F', StringType(), True),&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;StructField('CURNCY_T', StringType(), True),&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;StructField('TRANSACTION', StringType(), True),&lt;BR /&gt;StructField('FROM', StringType(), True),&lt;BR /&gt;StructField('TO', StringType(), True)])&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;may if you want to have your selection order as defined in schema you can later again adjust in your dataframe by using .select() and choose your prefered columns inside the select and have an new dataframe&lt;/P&gt;</description>
      <pubDate>Sun, 14 Apr 2024 16:31:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/when-reading-a-csv-file-with-spark-read-the-data-is-not-loading/m-p/66215#M33054</guid>
      <dc:creator>sai_sathya</dc:creator>
      <dc:date>2024-04-14T16:31:46Z</dc:date>
    </item>
  </channel>
</rss>

