cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

joakon
by New Contributor III
  • 2600 Views
  • 5 replies
  • 1 kudos

Resolved! slow running query

Hi All, I would you to get some ideas on how to improve performance on a data frame with around 10M rows. adls- gen2df1 =source1 , format , parquet ( 10 m)df2 =source2 , format , parquet ( 10 m)df = join df1 and df2 type =inner join df.count() is ...

  • 2600 Views
  • 5 replies
  • 1 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 1 kudos

hey @raghu maremanda​ did you get any answer if yes ,please update here, by that other people can also get the solution

  • 1 kudos
4 More Replies
AK98
by New Contributor II
  • 3781 Views
  • 4 replies
  • 0 kudos

Py4JJavaError when trying to write dataframe to delta table

I'm trying to write a dataframe to a delta table and am getting this error.Im not sure what the issue is as I had no problem successfully writing other dataframes to delta tables. I attached a snippet of the data as well along with the schema:

image.png image image
  • 3781 Views
  • 4 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Ravi Teja​, We haven’t heard from you since the last response from @Rishabh Pandey​​​, and I was checking back to see if his suggestions helped you.Or else, If you have any solution, please share it with the community, as it can be helpful to oth...

  • 0 kudos
3 More Replies
ratnakarsinha
by New Contributor II
  • 18853 Views
  • 3 replies
  • 0 kudos

How to get full result using DataFrame.Display method

Hi, Dataframe.Display method in Databricks notebook fetches only 1000 rows by default. Is there a way to change this default to display and download full result (more than 1000 rows) in python? Thanks, Ratnakar.

  • 18853 Views
  • 3 replies
  • 0 kudos
Latest Reply
ramravi
Contributor II
  • 0 kudos

display method doesn't have the option to choose the number of rows. Use the show method. It is not neat and you can't do visualizations and downloads.

  • 0 kudos
2 More Replies
Mado
by Valued Contributor II
  • 6644 Views
  • 6 replies
  • 2 kudos

Resolved! How to see if condition is True / False for all rows in a DataFrame?

Assume that I have a Spark DataFrame, and I want to see if records satisfy a condition.Example dataset:# Prepare Data data = [('A', 1), \ ('A', 2), \ ('B', 3) ]   # Create DataFrame columns= ['col_1', 'col_2'] df = spark.createDataF...

image image
  • 6644 Views
  • 6 replies
  • 2 kudos
Latest Reply
Ajay-Pandey
Esteemed Contributor III
  • 2 kudos

Hi you can use display() or show() function that will provide you expected results.

  • 2 kudos
5 More Replies
joakon
by New Contributor III
  • 8800 Views
  • 7 replies
  • 6 kudos
  • 8800 Views
  • 7 replies
  • 6 kudos
Latest Reply
huyd
New Contributor III
  • 6 kudos

check your read cell, "Delimeter"

  • 6 kudos
6 More Replies
Sujitha
by Community Manager
  • 1725 Views
  • 6 replies
  • 5 kudos

KB Feedback Discussion  In addition to the Databricks Community, we have a Support team that maintains a Knowledge Base (KB). The KB contains answers ...

KB Feedback Discussion In addition to the Databricks Community, we have a Support team that maintains a Knowledge Base (KB). The KB contains answers to common questions about Databricks, as well as information on optimisation and troubleshooting.Thes...

  • 1725 Views
  • 6 replies
  • 5 kudos
Latest Reply
Ajay-Pandey
Esteemed Contributor III
  • 5 kudos

Thanks for sharing @Sujitha Ramamoorthy​ 

  • 5 kudos
5 More Replies
James_209101
by New Contributor II
  • 4894 Views
  • 2 replies
  • 5 kudos

Using large dataframe in-memory (data not allowed to be "at rest") results in driver crash and/or out of memory

I'm having trouble working on Databricks with data that we are not allowed to save off or persist in any way. The data comes from an API (which returns a JSON response). We have a scala package on our cluster that makes the queries (almost 6k queries...

  • 4894 Views
  • 2 replies
  • 5 kudos
Latest Reply
Anonymous
Not applicable
  • 5 kudos

Hi @James Held​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thanks!

  • 5 kudos
1 More Replies
tassiodahora
by New Contributor III
  • 53366 Views
  • 3 replies
  • 9 kudos

Resolved! Failed to merge incompatible data types LongType and StringType

Guys, good morning!I am writing the results of a json in a delta table, only the json structure is not always the same, if the field does not list in the json it generates type incompatibility when I append(dfbrzagend.write .format("delta") .mode("ap...

  • 53366 Views
  • 3 replies
  • 9 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 9 kudos

Hi @Tássio Santos​ , We haven’t heard from you on the last response from @Chetan Kardekar​ , and I was checking back to see if you have a resolution yet. If you have any solution, please share it with the community as it can be helpful to others. Oth...

  • 9 kudos
2 More Replies
Manjusha
by New Contributor II
  • 1850 Views
  • 1 replies
  • 1 kudos

SocketTimeout exception when running a display command on spark dataframe

I am using runtime 9.1LTSI have a R notebook that reads a csv into a R dataframe and does some transformations and finally is converted to spark dataframe using the createDataFrame function.after that when I call the display function on this spark da...

  • 1850 Views
  • 1 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Manjusha Unnikrishnan​ Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else bricksters will get back to you soon. Thanks.

  • 1 kudos
rajat1
by New Contributor
  • 12389 Views
  • 3 replies
  • 2 kudos

How to convert dataframe (df), to a excel file that I can share with my colleagues ?

I am working on microsoft azure databrick, I have a final dataframe of shape (3276*23) , I want to share it in form of excel file? How can I do it ( I am using ->df.to_excel('fileOutput.xlsx', sheet_name = 'Sheet1', index = False) , command is runn...

  • 12389 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

You could try this way, convert Pyspark Dataframe to Pandas Dataframe then export to excel file.

  • 2 kudos
2 More Replies
wyzer
by Contributor II
  • 3928 Views
  • 2 replies
  • 12 kudos

Resolved! Add the creation date of a parquet file into a DataFrame

Currently I load multiple parquet file with this code:df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")(Inside the Voucher folder, there is one folder by date. Each one containing one parquet file)How can I add a column into this DataFrame, that...

  • 3928 Views
  • 2 replies
  • 12 kudos
Latest Reply
wyzer
Contributor II
  • 12 kudos

Thanks @Michail Karamanos​ 

  • 12 kudos
1 More Replies
markdias
by New Contributor II
  • 1307 Views
  • 3 replies
  • 2 kudos

Which is quicker: grouping a table that is a join of several others or querying data?

This may be a tricky question, so please bear with meIn a real life scenario, i have a dataframe (i'm using pyspark) called age, with is a groupBy of other 4 dataframes. I join these 4 so at the end I have a few million rows, but after the groupBy th...

  • 1307 Views
  • 3 replies
  • 2 kudos
Latest Reply
NhatHoang
Valued Contributor II
  • 2 kudos

Hi @Marcos Dias​ ,Frankly, I think we need more detail to answer your question:Are these 4 dataframes​ updated their data?How often you use the groupBy-dataframe?

  • 2 kudos
2 More Replies
Anonymous
by Not applicable
  • 7687 Views
  • 9 replies
  • 7 kudos

Resolved! data frame takes unusually long time to write for small data sets

We have configured workspace with own vpc. We need to extract data from DB2 and write as delta format. we tried to for 550k records with 230 columns, it took 50mins to complete the task. 15mn records takes more than 18hrs. Not sure why this takes suc...

  • 7687 Views
  • 9 replies
  • 7 kudos
Latest Reply
elgeo
Valued Contributor II
  • 7 kudos

Hello. We face exactly the same issue. Reading is quick but writing takes long time. Just to clarify that it is about a table with only 700k rows. Any suggestions please? Thank youremote_table = spark.read.format ( "jdbc" ) \.option ( "driver" , "com...

  • 7 kudos
8 More Replies
Labels