I have following code which reads the stream of data and process the data in the foreachBatch and writes to the provided path as shown below.
public static void writeToDatalake(SparkSession session, Configuration config, Dataset<Row> data, Entity entity, throws TimeoutException, StreamingQueryException {
String writePath = getWritePath(config, entity, profile);
log.info(writePath);
data.writeStream()
.outputMode(OutputMode.Update))
.format(source:"delta")
.foreachBatch(mergeData(session, config, entity, writePath))
.option("checkpointLocation", writePath + CHECKPOINT_PATH)
.trigger(Trigger.Once())
start()
.awaitTermination();
}
In this merge function, I'm checking if the table exists in the path, if not I'm creating new one and writing the data. If it exists, I'm straight forwardly writing as shown below.
I have used foreachBatch to specify a custom function (mergeData) to be executed for each batch of data generated by the streaming query. The mergeData function is responsible for merging the current batch of data with the existing data in the Delta Lake.
public static VoidFunction2<Dataset<Row>, Long> mergeData(SparkSession session, Configuration config, Entity entity {
return (data, batchid) -> {
boolean exists = DeltaTable.isDeltaTable(writePath);
if (lexists) {
Dataset<Row> emptyDF = session.createDataFrame(new ArrayList<>(), data.schema());
emptyDF.write()
.format("delta")
.mode(SaveMode.Overwrite)
.partitionBy(
JavaConverters.asScalaBuffer(config.getReadBlobStorage.get(entity.getSource).getPartition())
.save(writePath);
log.info("created Empty df");
emptyDF.unpersist();
writeData(writePath, entity, data, config);
data.unpersist();
}
};
}
This should create deltalog folder in blob store if doesn't exists. But it's not when I run this in databricks. I'm ending up with error as shown below:
> com.databricks.sql.transaction.tahoe.DeltaAnalysisException:
> Incompatible format detected. You are trying to write to
> `abfss://<path>/`
> using Delta, but there is no transaction log present. Check the
> upstream job to make sure that it is writing using format("delta") and
> that you are trying to write to the table base path.