cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

andrew0117
by Contributor
  • 3830 Views
  • 6 replies
  • 2 kudos

index a dataframe from a csv file based on the file's original order (not based on any specific column, based on the entire row) using spark

how to guarantee the index is always following the file's original order no matter what. Currently, I'm using val df = spark.read.options(Map("header"-> "true", "inferSchema" -> "true")).csv("filePath").withColumn("index", monotonically_increasing...

  • 3830 Views
  • 6 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 2 kudos

monotonically_increasing_id will not as it is to guarantee that every partition has separate ids. What is the whole code? Do you load directory with a lot of CSVs? What "original order" means? Is it csvs ordered by file creation date, by file name? o...

  • 2 kudos
5 More Replies
Callum
by New Contributor II
  • 9880 Views
  • 3 replies
  • 2 kudos

Pyspark Pandas column or index name appears to persist after being dropped or removed.

So, I have this code for merging dataframes with pyspark pandas. And I want the index of the left dataframe to persist throughout the joins. So following suggestions from others wanting to keep the index after merging, I set the index to a column bef...

  • 9880 Views
  • 3 replies
  • 2 kudos
Latest Reply
Serlal
New Contributor III
  • 2 kudos

Hi!I tried debugging your code and I think that the error you get is simply because the column exists in two instances of your dataframe within your loop.I tried adding some extra debug lines in your merge_dataframes function:and after executing that...

  • 2 kudos
2 More Replies
Labels