cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

index a dataframe from a csv file based on the file's original order (not based on any specific column, based on the entire row) using spark

andrew0117
Contributor

how to guarantee the index is always following the file's original order no matter what. Currently, I'm using val df = spark.read.options(Map("header"-> "true", "inferSchema" -> "true")).csv("filePath").withColumn("index", monotonically_increasing_id()) .

Thanks!

6 REPLIES 6

Hubert-Dudek
Esteemed Contributor III

monotonically_increasing_id will not as it is to guarantee that every partition has separate ids. What is the whole code? Do you load directory with a lot of CSVs? What "original order" means? Is it csvs ordered by file creation date, by file name? or just by row number in the file, as there is only one file? It is doable, but you need to provide more details.

Thank you for the replay.

It is just a single csv file with thousands or millions of rows. But there is not any timestamp or row number or whatever to tell which row has the newest data. The situation is that if the primary key (combination of two columns, the file has more than 20 columns) happens to have duplicates by mistake, I need to keep the newest record only. The original order here means the order in which the file is displayed when opening it with any app. The last row in that original order is considered as the newest data.

Hubert-Dudek
Esteemed Contributor III

I can not find my code, but I remember using spark.read().text("file_name") and then manipulated the file (explode etc.) to get lines in the correct order. Of course, it will be slower, and as the whole file will go to one cell, it has memory limits as it will go through a single worker. So files have to be smaller than the RAM on the worker.

There is no spark function showing a row in the source (as it splits everything per partition and works on chunks), so other solutions will not 100% guaranteed.

If the file is really big or as alternative, you need to add ID inside the file.

this file is dropped by end user into an azure blob storage on weekly basis and the size might vary dramatically. I will process it through azure databricks notebook called by an azure data factory pipeline, in which I can set up the cluster configuration for adb. So, if I set up the work node to 1, could it guarantee the index I added with monotonically_increasing_id() function aligns with file's original order without considering performance? Thanks!

T-Rex
New Contributor II

Hi @andrew li​ 

Were you able to solve this one ? I have the exact same scenario.

Thx

No really. Eventually we decided to fail the whole process and send the notification to the end user to have them to do the dedup.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.