Dataframe Write Append to Parquet Table - Partitio...

RobertWalsh · ‎08-23-2015

Hello,

I am attempting to append new json files into an existing parquet table defined in Databricks.

Using a dataset defined by this command (dataframe initially added to a temp table):

val output = sql("select headers.event_name, to_date(from_unixtime(headers.received_timestamp)) as dt, from_unixtime(headers.received_timestamp) AS login_datetime, headers.ip_address, headers.acting_user_id from usersloggedInRaw_tmp")

I create the initial table with the following:

output.write.format("parquet").partitionBy("dt").saveAsTable("dev_sessions")

This output of this table looks like the following:

If I try to append a new json file to the now existing 'dev_session' table, using the following:

output.write.mode("append").format("parquet").partitionBy("dt").saveAsTable("dev_sessions")

Here is what I see:

The dataset seems to 'shift'. For example, the acting_user_id value is now populating the 'dt' column, the column used in the append command to partition the data.

I have tried this flow multiple times and can reproduce the same result. Is this a bug in dataframe.write(), or am I making a mistake somewhere? Note that prior to appending the table, I inspect the 'output' dataframe in databricks via the display() command and there is no issues - the values are in their expected columns. It is only after appending to the table using the write command that the issue seems to occur.

Any help that can be provided would be sincerely appreciated.

Dataframe Write Append to Parquet Table - Partition Issue