topic Re: How to set the timestamp format when reading CSV in Data Engineering

How to set the timestamp format when reading CSV

EmilianoParizz1 — Thu, 09 May 2019 18:24:29 GMT

I have a Databricks 5.3 cluster on Azure which runs Apache Spark 2.4.0 and Scala 2.11.

I'm trying to parse a CSV file with a custom timestamp format but I don't know which datetime pattern format Spark uses.

My CSV looks like this:

Timestamp, Name, Value  
02/07/2019 14:51:32.869-08:00, BatteryA, 0.25  
02/07/2019 14:55:45.343-08:00, BatteryB, 0.50  
02/07/2019 14:58:25.845-08:00, BatteryC, 0.34

I'm executing the following to read it:val csvDataFrame = sqlContext.read.format("csv") .option("header", "true") .option("treatEmptyValuesAsNulls", "true") .option("inferSchema", "true") .option("mode","DROPMALFORMED") .option("timestampFormat", "MM/dd/yyyy HH:mm:ss.SSSZZ") .load("path/to/file.csv")

csvDataFrame.printSchema()

But no matter what timestamp pattern I use, the first column is always inferred as string.

csvDataFrame:org.apache.spark.sql.DataFrame
  Timestamp:string
  Name:string
  Value:double

I'm not a Java/Scala developer and I'm new to Spark/Databricks. I can't find anywhere which datetime formatter does Spark use to parse the values.

Re: How to set the timestamp format when reading CSV

mekkinen — Tue, 16 Jul 2019 07:14:52 GMT

At least based on the Pyspark documentation: (https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader) it it stated that:

dateFormat – sets the string that indicates a date format. Custom date formats follow the formats at
```
java.text.SimpleDateFormat
```
. This applies to date type. If None is set, it uses the default value,
```
yyyy-MM-dd
```
.
timestampFormat – sets the string that indicates a timestamp format. Custom date formats follow the formats at
```
java.text.SimpleDateFormat
```
. This applies to timestamp type. If None is set, it uses the default value,
```
yyyy-MM-dd'T'HH:mm:ss.SSSXXX
```
.

I would imagine that these were the same in the case of writing scala.

Re: How to set the timestamp format when reading CSV

DonatienTessier — Tue, 16 Jul 2019 12:20:29 GMT

Hi @Emiliano Parizzi,

You could parsed the timestamp after loading the file with using the withColumn (cf. https://stackoverflow.com/questions/39088473/pyspark-dataframe-convert-unusual-string-format-to-timestamp).

from pyspark.sql import Row from pyspark.sql.functions import to_timestamp

(sc .parallelize([Row(dt='02/07/2019 14:51:32.869-08:00')]) .toDF() .withColumn("parsed", to_timestamp("dt", "MM/dd/yyyy HH:mm:ss.SSSXXX")) .show(1, False))

+-----------------------------+-------------------+ |dt |parsed | +-----------------------------+-------------------+ |02/07/2019 14:51:32.869-08:00|2019-02-07 22:51:32| +-----------------------------+-------------------+

Re: How to set the timestamp format when reading CSV

SteveDocherty — Fri, 26 Jul 2019 19:53:55 GMT

# in python: explicitly define the schema, read in CSV data using the schema and a defined timestamp format [and an extra column to be used for partitioning; this part is optional] csvSchema = StructType([ StructField("Timestamp",TimestampType(),True), StructField("Name",StringType(),True), StructField("Value",DoubleType(),True) ])

df = spark.read \ .csv(file_path, header = True, multiLine = True, escape = "\"", schema = csvSchema, timestampFormat = "MM/dd/yyyy HH:mm:ss.SSSZZ" ) \ .withColumn("year", date_format(col("Timestamp"), "yyyy").cast(IntegerType())) \ .withColumn("month", date_format(col("Timestamp"), "MM").cast(IntegerType()))

display(df)

Re: How to set the timestamp format when reading CSV

wellington72019 — Tue, 13 Aug 2019 06:46:52 GMT

# in python: explicitly define the schema, read in CSV data using the schema and a defined timestamp format....

<a href="http://thestoreguide.co.nz/auckland/orewa/mcdonalds-orewa-akl-0931/">McDonald’s in Orewa</a>