<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to set the timestamp format when reading CSV in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-set-the-timestamp-format-when-reading-csv/m-p/28052#M19890</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;# in python: explicitly define the schema, read in CSV data using the schema and a defined timestamp format....&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;&amp;lt;a href="http://thestoreguide.co.nz/auckland/orewa/mcdonalds-orewa-akl-0931/"&amp;gt;McDonald’s in Orewa&amp;lt;/a&amp;gt;
&lt;/CODE&gt;&lt;/PRE&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 13 Aug 2019 06:46:52 GMT</pubDate>
    <dc:creator>wellington72019</dc:creator>
    <dc:date>2019-08-13T06:46:52Z</dc:date>
    <item>
      <title>How to set the timestamp format when reading CSV</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-set-the-timestamp-format-when-reading-csv/m-p/28048#M19886</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I have a Databricks 5.3 cluster on Azure which runs Apache Spark 2.4.0 and Scala 2.11.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;I'm trying to parse a CSV file with a custom timestamp format but I don't know which datetime pattern format Spark uses.&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;My CSV looks like this:
&lt;PRE&gt;&lt;CODE&gt;Timestamp, Name, Value  
02/07/2019 14:51:32.869-08:00, BatteryA, 0.25  
02/07/2019 14:55:45.343-08:00, BatteryB, 0.50  
02/07/2019 14:58:25.845-08:00, BatteryC, 0.34
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;I'm executing the following to read it:val csvDataFrame = sqlContext.read.format("csv") .option("header", "true") .option("treatEmptyValuesAsNulls", "true") .option("inferSchema", "true") .option("mode","DROPMALFORMED") .option("timestampFormat", "MM/dd/yyyy HH:mm:ss.SSSZZ") .load("path/to/file.csv")
&lt;P&gt;&lt;/P&gt; 
&lt;P&gt;csvDataFrame.printSchema() &lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;But no matter what timestamp pattern I use, the first column is always inferred as string.
&lt;PRE&gt;&lt;CODE&gt;csvDataFrame:org.apache.spark.sql.DataFrame
  Timestamp:string
  Name:string
  Value:double
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;I'm not a Java/Scala developer and I'm new to Spark/Databricks. I can't find anywhere which datetime formatter does Spark use to parse the values. 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 09 May 2019 18:24:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-set-the-timestamp-format-when-reading-csv/m-p/28048#M19886</guid>
      <dc:creator>EmilianoParizz1</dc:creator>
      <dc:date>2019-05-09T18:24:29Z</dc:date>
    </item>
    <item>
      <title>Re: How to set the timestamp format when reading CSV</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-set-the-timestamp-format-when-reading-csv/m-p/28049#M19887</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;At least based on the &lt;B&gt;Pyspark &lt;/B&gt;documentation: (https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader) it it stated that:&lt;/P&gt;
&lt;P&gt;"&lt;/P&gt;
&lt;UL&gt;&lt;LI&gt;&lt;B&gt;dateFormat&lt;/B&gt; – sets the string that indicates a date format. Custom date formats follow the formats at &lt;PRE&gt;&lt;CODE&gt;java.text.SimpleDateFormat&lt;/CODE&gt;&lt;/PRE&gt;. This applies to date type. If None is set, it uses the default value, &lt;PRE&gt;&lt;CODE&gt;yyyy-MM-dd&lt;/CODE&gt;&lt;/PRE&gt;.&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;timestampFormat&lt;/B&gt; – sets the string that indicates a timestamp format. Custom date formats follow the formats at &lt;PRE&gt;&lt;CODE&gt;java.text.SimpleDateFormat&lt;/CODE&gt;&lt;/PRE&gt;. This applies to timestamp type. If None is set, it uses the default value, &lt;PRE&gt;&lt;CODE&gt;yyyy-MM-dd'T'HH:mm:ss.SSSXXX&lt;/CODE&gt;&lt;/PRE&gt;.&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;"&lt;/P&gt;
&lt;P&gt;I would imagine that these were the same in the case of writing scala.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jul 2019 07:14:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-set-the-timestamp-format-when-reading-csv/m-p/28049#M19887</guid>
      <dc:creator>mekkinen</dc:creator>
      <dc:date>2019-07-16T07:14:52Z</dc:date>
    </item>
    <item>
      <title>Re: How to set the timestamp format when reading CSV</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-set-the-timestamp-format-when-reading-csv/m-p/28050#M19888</link>
      <description>&lt;P&gt;Hi @Emiliano Parizzi,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You could parsed the timestamp after loading the file with using the withColumn (cf. &lt;A href="https://stackoverflow.com/questions/39088473/pyspark-dataframe-convert-unusual-string-format-to-timestamp)" target="test_blank"&gt;https://stackoverflow.com/questions/39088473/pyspark-dataframe-convert-unusual-string-format-to-timestamp)&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;from pyspark.sql import Row from pyspark.sql.functions import to_timestamp&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;(sc .parallelize([Row(dt='02/07/2019 14:51:32.869-08:00')]) .toDF() .withColumn("parsed", to_timestamp("dt", "MM/dd/yyyy HH:mm:ss.SSSXXX")) .show(1, False))&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;+-----------------------------+-------------------+ |dt |parsed | +-----------------------------+-------------------+ |02/07/2019 14:51:32.869-08:00|2019-02-07 22:51:32| +-----------------------------+-------------------+ &lt;/P&gt;</description>
      <pubDate>Tue, 16 Jul 2019 12:20:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-set-the-timestamp-format-when-reading-csv/m-p/28050#M19888</guid>
      <dc:creator>DonatienTessier</dc:creator>
      <dc:date>2019-07-16T12:20:29Z</dc:date>
    </item>
    <item>
      <title>Re: How to set the timestamp format when reading CSV</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-set-the-timestamp-format-when-reading-csv/m-p/28051#M19889</link>
      <description>&lt;P&gt;# in python: explicitly define the schema, read in CSV data using the schema and a defined timestamp format [and an extra column to be used for partitioning; this part is optional] csvSchema = StructType([ StructField("Timestamp",TimestampType(),True), StructField("Name",StringType(),True), StructField("Value",DoubleType(),True) ])&lt;/P&gt; 
&lt;P&gt;df = spark.read \ .csv(file_path, header = True, multiLine = True, escape = "\"", schema = csvSchema, timestampFormat = "MM/dd/yyyy HH:mm:ss.SSSZZ" ) \ .withColumn("year", date_format(col("Timestamp"), "yyyy").cast(IntegerType())) \ .withColumn("month", date_format(col("Timestamp"), "MM").cast(IntegerType())) &lt;/P&gt; 
&lt;P&gt;display(df)  &lt;/P&gt;</description>
      <pubDate>Fri, 26 Jul 2019 19:53:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-set-the-timestamp-format-when-reading-csv/m-p/28051#M19889</guid>
      <dc:creator>SteveDocherty</dc:creator>
      <dc:date>2019-07-26T19:53:55Z</dc:date>
    </item>
    <item>
      <title>Re: How to set the timestamp format when reading CSV</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-set-the-timestamp-format-when-reading-csv/m-p/28052#M19890</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;# in python: explicitly define the schema, read in CSV data using the schema and a defined timestamp format....&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;&amp;lt;a href="http://thestoreguide.co.nz/auckland/orewa/mcdonalds-orewa-akl-0931/"&amp;gt;McDonald’s in Orewa&amp;lt;/a&amp;gt;
&lt;/CODE&gt;&lt;/PRE&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 13 Aug 2019 06:46:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-set-the-timestamp-format-when-reading-csv/m-p/28052#M19890</guid>
      <dc:creator>wellington72019</dc:creator>
      <dc:date>2019-08-13T06:46:52Z</dc:date>
    </item>
  </channel>
</rss>

