cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

KKo
by Contributor III
  • 11424 Views
  • 4 replies
  • 2 kudos

Resolved! Union Multiple dataframes in loop, with different schema

With in a loop I have few dataframes created. I can union them with out an issue if they have same schema using (df_unioned = reduce(DataFrame.unionAll, df_list). Now my problem is how to union them if one of the dataframe in df_list has different nu...

  • 11424 Views
  • 4 replies
  • 2 kudos
Latest Reply
anoopunni
New Contributor II
  • 2 kudos

Hi,I have come across same scenario, using reduce() and unionByname we can implement the solution as below:val lstDF: List[Datframe] = List(df1,df2,df3,df4,df5)val combinedDF = lstDF.reduce((df1, df2) => df1.unionByName(df2, allowMissingColumns = tru...

  • 2 kudos
3 More Replies
lmcglone
by New Contributor II
  • 4384 Views
  • 2 replies
  • 3 kudos

Comparing 2 dataframes and create columns from values within a dataframe

Hi,I have a dataframe that has name and companyfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()columns = ["company","name"]data = [("company1", "Jon"), ("company2", "Steve"), ("company1", "...

image
  • 4384 Views
  • 2 replies
  • 3 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 3 kudos

You need to join and pivotdf .join(df2, on=[df.company == df2.job_company])) .groupBy("company", "name") .pivot("job_company") .count()

  • 3 kudos
1 More Replies
Ancil
by Contributor II
  • 14099 Views
  • 11 replies
  • 1 kudos

Any on please suggest how we can effectively loop through PySpark Dataframe .

Scenario: I Have a dataframe with more than 1000 rows, each row having a file path and result data column. I need to loop through each row and write files to the file path, with data from the result column.what is the easiest and time effective way ...

image
  • 14099 Views
  • 11 replies
  • 1 kudos
Latest Reply
NhatHoang
Valued Contributor II
  • 1 kudos

Hi,​I agree with Werners, try to avoid loop with Pyspark Dataframe.If your dataframe is small, as you said, only about 1000 rows, you may consider to use Pandas.Thanks.​

  • 1 kudos
10 More Replies
elgeo
by Valued Contributor II
  • 3886 Views
  • 0 replies
  • 2 kudos

SQL While do loops

Hello. Could you please suggest a workaround for a while do loop in Databricks SQL?WHILE LSTART>0 DO SET LSTRING=CONCAT(LSTRING, VSTRING2)Thank you in advance

  • 3886 Views
  • 0 replies
  • 2 kudos
Jackie
by New Contributor II
  • 5584 Views
  • 4 replies
  • 6 kudos

Resolved! speed up a for loop in python (azure databrick)

code example# a list of file pathlist_files_path = ["/dbfs/mnt/...", ..., "/dbfs/mnt/..."]# copy all file above to this folderdest_path=""/dbfs/mnt/..."for file_path in list_files_path: # copy function copy_file(file_path, dest_path)I am runni...

  • 5584 Views
  • 4 replies
  • 6 kudos
Latest Reply
Hemant
Valued Contributor II
  • 6 kudos

@Jackie Chan​ , What's the data size you want to copy? If it's bigger, then use ADF.

  • 6 kudos
3 More Replies
Braxx
by Contributor II
  • 2560 Views
  • 2 replies
  • 3 kudos

Resolved! issue with rounding selected column in "for in" loop

This must be trivial, but I must have missed something.I have a dataframe (test1) and want to round all the columns listed in list of columns (col_list)here is the code I am running:col_list = ['measure1', 'measure2', 'measure3']   for i in col_list:...

image image
  • 2560 Views
  • 2 replies
  • 3 kudos
Latest Reply
Braxx
Contributor II
  • 3 kudos

You're absolutely right. thanks

  • 3 kudos
1 More Replies
bdc
by New Contributor III
  • 6811 Views
  • 4 replies
  • 5 kudos

Resolved! Is it possible to show multiple cmd output in a dashboard?

I have a loop that outputs a dataframe for values in a list; basically a loop. I can create a dashboard if there is only one df but in the loop, I'm only able to see the charts in the notebook if I switch the view to charts not in the dashboard. In t...

  • 6811 Views
  • 4 replies
  • 5 kudos
Latest Reply
Wanda11
New Contributor II
  • 5 kudos

If you want to be able to easily run and kill multiple process with ctrl-c, this is my favorite method: spawn multiple background processes in a (…) subshell, and trap SIGINT to execute kill 0, which will kill everything spawned in the subshell group...

  • 5 kudos
3 More Replies
pine
by New Contributor III
  • 3305 Views
  • 5 replies
  • 4 kudos

Resolved! Databricks fails writing after writing ~30 files

Good day, Copy of https://stackoverflow.com/questions/69974301/looping-through-files-in-databricks-failsI got 100 files of csv data on adls-gen1 store. I want to do some processing to them and save results to same drive, different directory. def look...

  • 3305 Views
  • 5 replies
  • 4 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 4 kudos

was actually anything created by script in directory <my_output_dir>?The best would be to permanently mount ADSL storage and use azure app for that.In Azure please go to App registrations - register app with name for example "databricks_mount" . Ad...

  • 4 kudos
4 More Replies
FernandoBenedet
by New Contributor
  • 5161 Views
  • 2 replies
  • 0 kudos

Loop through Dataframe in Python

Hello, Imagine you have a dataframe with cols: A, B, C. I want to add a column D based on some calculations of columns B and C of the previous record of the df. Which is the best way of doing this? I am trying to avoid looping through the df. I am u...

  • 5161 Views
  • 2 replies
  • 0 kudos
Latest Reply
quincybatten
New Contributor II
  • 0 kudos

Iterating through pandas dataFrame objects is generally slow. Pandas Iteration beats the whole purpose of using DataFrame. It is an anti-pattern and is something you should only do when you have exhausted every other option. It is better look for a...

  • 0 kudos
1 More Replies
Labels