How to set the timestamp format when reading CSV
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2019 11:24 AM
I have a Databricks 5.3 cluster on Azure which runs Apache Spark 2.4.0 and Scala 2.11.
I'm trying to parse a CSV file with a custom timestamp format but I don't know which datetime pattern format Spark uses.My CSV looks like this:Timestamp, Name, Value
02/07/2019 14:51:32.869-08:00, BatteryA, 0.25
02/07/2019 14:55:45.343-08:00, BatteryB, 0.50
02/07/2019 14:58:25.845-08:00, BatteryC, 0.34
I'm executing the following to read it:val csvDataFrame = sqlContext.read.format("csv") .option("header", "true") .option("treatEmptyValuesAsNulls", "true") .option("inferSchema", "true") .option("mode","DROPMALFORMED") .option("timestampFormat", "MM/dd/yyyy HH:mm:ss.SSSZZ") .load("path/to/file.csv")
csvDataFrame.printSchema()
But no matter what timestamp pattern I use, the first column is always inferred as string.csvDataFrame:org.apache.spark.sql.DataFrame
Timestamp:string
Name:string
Value:double
I'm not a Java/Scala developer and I'm new to Spark/Databricks. I can't find anywhere which datetime formatter does Spark use to parse the values.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-16-2019 12:14 AM
At least based on the Pyspark documentation: (https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader) it it stated that:
"
- dateFormat – sets the string that indicates a date format. Custom date formats follow the formats at
. This applies to date type. If None is set, it uses the default value,java.text.SimpleDateFormat
.yyyy-MM-dd
- timestampFormat – sets the string that indicates a timestamp format. Custom date formats follow the formats at
. This applies to timestamp type. If None is set, it uses the default value,java.text.SimpleDateFormat
.yyyy-MM-dd'T'HH:mm:ss.SSSXXX
"
I would imagine that these were the same in the case of writing scala.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-16-2019 05:20 AM
Hi @Emiliano Parizzi,
You could parsed the timestamp after loading the file with using the withColumn (cf. https://stackoverflow.com/questions/39088473/pyspark-dataframe-convert-unusual-string-format-to-time....
from pyspark.sql import Row from pyspark.sql.functions import to_timestamp
(sc .parallelize([Row(dt='02/07/2019 14:51:32.869-08:00')]) .toDF() .withColumn("parsed", to_timestamp("dt", "MM/dd/yyyy HH:mm:ss.SSSXXX")) .show(1, False))
+-----------------------------+-------------------+ |dt |parsed | +-----------------------------+-------------------+ |02/07/2019 14:51:32.869-08:00|2019-02-07 22:51:32| +-----------------------------+-------------------+
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-26-2019 12:53 PM
# in python: explicitly define the schema, read in CSV data using the schema and a defined timestamp format [and an extra column to be used for partitioning; this part is optional] csvSchema = StructType([ StructField("Timestamp",TimestampType(),True), StructField("Name",StringType(),True), StructField("Value",DoubleType(),True) ])
df = spark.read \ .csv(file_path, header = True, multiLine = True, escape = "\"", schema = csvSchema, timestampFormat = "MM/dd/yyyy HH:mm:ss.SSSZZ" ) \ .withColumn("year", date_format(col("Timestamp"), "yyyy").cast(IntegerType())) \ .withColumn("month", date_format(col("Timestamp"), "MM").cast(IntegerType()))
display(df)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-12-2019 11:46 PM
# in python: explicitly define the schema, read in CSV data using the schema and a defined timestamp format....
<a href="http://thestoreguide.co.nz/auckland/orewa/mcdonalds-orewa-akl-0931/">McDonald’s in Orewa</a>
![](/skins/images/582998B45490C7019731A5B3A872C751/responsive_peak/images/icon_anonymous_message.png)
![](/skins/images/582998B45490C7019731A5B3A872C751/responsive_peak/images/icon_anonymous_message.png)