Databricks

andrew0117 · ‎01-05-2023

how to guarantee the index is always following the file's original order no matter what. Currently, I'm using val df = spark.read.options(Map("header"-> "true", "inferSchema" -> "true")).csv("filePath").withColumn("index", monotonically_increasing_id()) .

Thanks!

Hubert-Dudek · ‎01-05-2023

monotonically_increasing_id will not as it is to guarantee that every partition has separate ids. What is the whole code? Do you load directory with a lot of CSVs? What "original order" means? Is it csvs ordered by file creation date, by file name? or just by row number in the file, as there is only one file? It is doable, but you need to provide more details.

andrew0117 · ‎01-05-2023

Thank you for the replay.

It is just a single csv file with thousands or millions of rows. But there is not any timestamp or row number or whatever to tell which row has the newest data. The situation is that if the primary key (combination of two columns, the file has more than 20 columns) happens to have duplicates by mistake, I need to keep the newest record only. The original order here means the order in which the file is displayed when opening it with any app. The last row in that original order is considered as the newest data.

Hubert-Dudek · ‎01-06-2023

I can not find my code, but I remember using spark.read().text("file_name") and then manipulated the file (explode etc.) to get lines in the correct order. Of course, it will be slower, and as the whole file will go to one cell, it has memory limits as it will go through a single worker. So files have to be smaller than the RAM on the worker.

There is no spark function showing a row in the source (as it splits everything per partition and works on chunks), so other solutions will not 100% guaranteed.

If the file is really big or as alternative, you need to add ID inside the file.

andrew0117 · ‎01-06-2023

this file is dropped by end user into an azure blob storage on weekly basis and the size might vary dramatically. I will process it through azure databricks notebook called by an azure data factory pipeline, in which I can set up the cluster configuration for adb. So, if I set up the work node to 1, could it guarantee the index I added with monotonically_increasing_id() function aligns with file's original order without considering performance? Thanks!