Data Engineering

by andrew0117 • Contributor

01-05-2023 12:03:56 PM

5639 Views
6 replies
2 kudos

index a dataframe from a csv file based on the file's original order (not based on any specific column, based on the entire row) using spark

how to guarantee the index is always following the file's original order no matter what. Currently, I'm using val df = spark.read.options(Map("header"-> "true", "inferSchema" -> "true")).csv("filePath").withColumn("index", monotonically_increasing...

Data Engineering

5639 Views
6 replies
2 kudos

01-05-2023 12:03:56 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-05-2023 1:39:33 PM

2 kudos

monotonically_increasing_id will not as it is to guarantee that every partition has separate ids. What is the whole code? Do you load directory with a lot of CSVs? What "original order" means? Is it csvs ordered by file creation date, by file name? o...

2 kudos

01-05-2023 1:39:33 PM

5 More Replies

by Callum • New Contributor II

12-01-2022 7:05:53 AM

13072 Views
3 replies
2 kudos

Pyspark Pandas column or index name appears to persist after being dropped or removed.

So, I have this code for merging dataframes with pyspark pandas. And I want the index of the left dataframe to persist throughout the joins. So following suggestions from others wanting to keep the index after merging, I set the index to a column bef...

Data Engineering

13072 Views
3 replies
2 kudos

12-01-2022 7:05:53 AM

View Replies

Latest Reply

Serlal
New Contributor III

01-31-2023 3:01:12 AM

2 kudos

Hi!I tried debugging your code and I think that the error you get is simply because the column exists in two instances of your dataframe within your loop.I tried adding some extra debug lines in your merge_dataframes function:and after executing that...

2 kudos

01-31-2023 3:01:12 AM

2 More Replies

Databricks Community

Forum Posts

index a dataframe from a csv file based on the file's original order (not based on any specific column, based on the entire row) using spark

Pyspark Pandas column or index name appears to persist after being dropped or removed.