โ01-05-2023 12:03 PM
how to guarantee the index is always following the file's original order no matter what. Currently, I'm using val df = spark.read.options(Map("header"-> "true", "inferSchema" -> "true")).csv("filePath").withColumn("index", monotonically_increasing_id()) .
Thanks!
โ01-05-2023 01:39 PM
monotonically_increasing_id will not as it is to guarantee that every partition has separate ids. What is the whole code? Do you load directory with a lot of CSVs? What "original order" means? Is it csvs ordered by file creation date, by file name? or just by row number in the file, as there is only one file? It is doable, but you need to provide more details.
โ01-05-2023 03:42 PM
Thank you for the replay.
It is just a single csv file with thousands or millions of rows. But there is not any timestamp or row number or whatever to tell which row has the newest data. The situation is that if the primary key (combination of two columns, the file has more than 20 columns) happens to have duplicates by mistake, I need to keep the newest record only. The original order here means the order in which the file is displayed when opening it with any app. The last row in that original order is considered as the newest data.
โ01-06-2023 01:29 AM
I can not find my code, but I remember using spark.read().text("file_name") and then manipulated the file (explode etc.) to get lines in the correct order. Of course, it will be slower, and as the whole file will go to one cell, it has memory limits as it will go through a single worker. So files have to be smaller than the RAM on the worker.
There is no spark function showing a row in the source (as it splits everything per partition and works on chunks), so other solutions will not 100% guaranteed.
If the file is really big or as alternative, you need to add ID inside the file.
โ01-06-2023 10:09 AM
this file is dropped by end user into an azure blob storage on weekly basis and the size might vary dramatically. I will process it through azure databricks notebook called by an azure data factory pipeline, in which I can set up the cluster configuration for adb. So, if I set up the work node to 1, could it guarantee the index I added with monotonically_increasing_id() function aligns with file's original order without considering performance? Thanks!
โ04-04-2023 08:18 AM
Hi @andrew liโ
Were you able to solve this one ? I have the exact same scenario.
Thx
โ04-09-2023 07:38 PM
No really. Eventually we decided to fail the whole process and send the notification to the end user to have them to do the dedup.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group