Databricks Community

EmilianoParizz1 · ‎05-09-2019

I have a Databricks 5.3 cluster on Azure which runs Apache Spark 2.4.0 and Scala 2.11.

I'm trying to parse a CSV file with a custom timestamp format but I don't know which datetime pattern format Spark uses.

My CSV looks like this:

Timestamp, Name, Value  
02/07/2019 14:51:32.869-08:00, BatteryA, 0.25  
02/07/2019 14:55:45.343-08:00, BatteryB, 0.50  
02/07/2019 14:58:25.845-08:00, BatteryC, 0.34

I'm executing the following to read it:val csvDataFrame = sqlContext.read.format("csv") .option("header", "true") .option("treatEmptyValuesAsNulls", "true") .option("inferSchema", "true") .option("mode","DROPMALFORMED") .option("timestampFormat", "MM/dd/yyyy HH:mm:ss.SSSZZ") .load("path/to/file.csv")

csvDataFrame.printSchema()

But no matter what timestamp pattern I use, the first column is always inferred as string.

csvDataFrame:org.apache.spark.sql.DataFrame
  Timestamp:string
  Name:string
  Value:double

I'm not a Java/Scala developer and I'm new to Spark/Databricks. I can't find anywhere which datetime formatter does Spark use to parse the values.

mekkinen · ‎07-16-2019

At least based on the Pyspark documentation: (https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader) it it stated that:

"

dateFormat – sets the string that indicates a date format. Custom date formats follow the formats at
```
java.text.SimpleDateFormat
```
. This applies to date type. If None is set, it uses the default value,
```
yyyy-MM-dd
```
.
timestampFormat – sets the string that indicates a timestamp format. Custom date formats follow the formats at
```
java.text.SimpleDateFormat
```
. This applies to timestamp type. If None is set, it uses the default value,
```
yyyy-MM-dd'T'HH:mm:ss.SSSXXX
```
.

"

I would imagine that these were the same in the case of writing scala.

DonatienTessier · ‎07-16-2019

Hi @Emiliano Parizzi,

You could parsed the timestamp after loading the file with using the withColumn (cf. https://stackoverflow.com/questions/39088473/pyspark-dataframe-convert-unusual-string-format-to-time....

from pyspark.sql import Row from pyspark.sql.functions import to_timestamp

(sc .parallelize([Row(dt='02/07/2019 14:51:32.869-08:00')]) .toDF() .withColumn("parsed", to_timestamp("dt", "MM/dd/yyyy HH:mm:ss.SSSXXX")) .show(1, False))

+-----------------------------+-------------------+ |dt |parsed | +-----------------------------+-------------------+ |02/07/2019 14:51:32.869-08:00|2019-02-07 22:51:32| +-----------------------------+-------------------+

SteveDocherty · ‎07-26-2019

# in python: explicitly define the schema, read in CSV data using the schema and a defined timestamp format [and an extra column to be used for partitioning; this part is optional] csvSchema = StructType([ StructField("Timestamp",TimestampType(),True), StructField("Name",StringType(),True), StructField("Value",DoubleType(),True) ])

df = spark.read \ .csv(file_path, header = True, multiLine = True, escape = "\"", schema = csvSchema, timestampFormat = "MM/dd/yyyy HH:mm:ss.SSSZZ" ) \ .withColumn("year", date_format(col("Timestamp"), "yyyy").cast(IntegerType())) \ .withColumn("month", date_format(col("Timestamp"), "MM").cast(IntegerType()))

display(df)

wellington72019 · ‎08-12-2019

# in python: explicitly define the schema, read in CSV data using the schema and a defined timestamp format....

<a href="http://thestoreguide.co.nz/auckland/orewa/mcdonalds-orewa-akl-0931/">McDonald’s in Orewa</a>

Databricks Community

How to set the timestamp format when reading CSV

Join Us as a Local Community Builder!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟