cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

jerry-xu-sa
by New Contributor II
  • 2170 Views
  • 2 replies
  • 1 kudos

Order of a dataframe is not perserved after calling cache() and limit()

Here are the simple steps to reproduce it. Note that col "foo" and "bar" are just redundant cols to make sure the dataframe doesn't fit into a single partition. // generate a random df val rand = new scala.util.Random val df = (1 to 3000).map(i => (r...

  • 2170 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Jerry Xu​ Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedback wil...

  • 1 kudos
1 More Replies
Gk
by New Contributor III
  • 3065 Views
  • 2 replies
  • 1 kudos

DataFrame

How can we create empty dataframe in databricks and how many ways we can create dataframe?

  • 3065 Views
  • 2 replies
  • 1 kudos
Latest Reply
Vartika
Moderator
  • 1 kudos

Hi @Govardhana Reddy​ Hope everything is going great.Does @Suteja Kanuri​'s answer help? If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. Cheers!

  • 1 kudos
1 More Replies
andrew0117
by Contributor
  • 3159 Views
  • 4 replies
  • 0 kudos

Resolved! Can merge() function be applied to dataframe?

if I have two dataframes df_target and df_source, can I do df_target.as("t).merge(df_source.as("s"), "s.id=t.id").whenMatched().updateAll().whenNotMatched.insertAll.execute(). when I tried the code above, I got the error "merge is not a member of the...

  • 3159 Views
  • 4 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @andrew li​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thanks!

  • 0 kudos
3 More Replies
nirajtanwar
by New Contributor
  • 1553 Views
  • 2 replies
  • 2 kudos

To collect the elements of a SparkDataFrame and coerces them into an R dataframe.

Hello Everyone,I am facing the challenge while collecting a spark dataframe into an R dataframe, this I need to do as I am using TraMineR algorithm whih is implemented in R only and the data pre-processing I have done in pysparkI am trying this:event...

  • 1553 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Niraj Tanwar​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thank...

  • 2 kudos
1 More Replies
Dale_Ware
by New Contributor III
  • 2229 Views
  • 2 replies
  • 3 kudos

Resolved! How to query a table with backslashes in the name.

I am trying to query a snowflake table from a databricks data frame similar to the following example.sql_query = "select * from Database.Schema.Table_/Name_/V"sqlContext.sql(f"{sql_query}" ) And I get an error like this.ParseException: [PARSE_SYNTAX_...

  • 2229 Views
  • 2 replies
  • 3 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 3 kudos

You can use Double Quotes to get the plan. Using quotes it is important to write the table names in capital letters.SELECT * FROM "/TABLE/NAME"

  • 3 kudos
1 More Replies
Merchiv
by New Contributor III
  • 11950 Views
  • 4 replies
  • 3 kudos

Resolved! How can I add a duration in milliseconds to a timestamp?

Let's say I have a DataFrame with a timestamp and an offset column in milliseconds respectively in the timestamp and long format. E.g.from datetime import datetime df = spark.createDataFrame( [ (datetime(2021, 1, 1), 1500, ), (dat...

  • 11950 Views
  • 4 replies
  • 3 kudos
Latest Reply
Merchiv
New Contributor III
  • 3 kudos

Although @Lakshay Goel​'s solution works, we've been using an alternative approach, that we found to be a bit more readable:from pyspark.sql import Column, functions as f     def make_dt_interval_sec(col: Column): return f.expr(f"make_dt_interval...

  • 3 kudos
3 More Replies
tinendra
by New Contributor III
  • 3388 Views
  • 7 replies
  • 8 kudos

Can we run pandas dataframe inside databricks?

Hi, I want to run df=pd.read_csv('/dbfs/FileStore/airlines1.csv') while trying to run getting error likeFileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStore/airlines1.csv'Could you please help me out how to run pandas dataframe in...

  • 3388 Views
  • 7 replies
  • 8 kudos
Latest Reply
Anonymous
Not applicable
  • 8 kudos

Hi @Tinendra Kumar​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

  • 8 kudos
6 More Replies
cristianc
by Contributor
  • 1291 Views
  • 2 replies
  • 0 kudos

Issue with visualizing dataframe from a job

Greetings,I have the following data set:```sqlSELECT * FROM ( VALUES ('2023-02',113.81::decimal(27,2),'A','X'), ('2023-02',112.66::decimal(27,2),'A','Y'), ('2023-02',1223.8::decimal(27,2),'B','X'), ('2023-02',1234.56::decimal(27,2),'B',...

  • 1291 Views
  • 2 replies
  • 0 kudos
Latest Reply
cristianc
Contributor
  • 0 kudos

Attaching some more screenshots to add more details.This seems to be a bug in the bar chart visualization widget when displaying from job run.

  • 0 kudos
1 More Replies
jonathan-dufaul
by Valued Contributor
  • 1327 Views
  • 2 replies
  • 0 kudos

Is there a function similar to display that downloads a dataframe?

I find myself constantly having to do display(df), and then "recompute with <5g records and download). I was just hoping I could skip the middleman and download from get go. ideally it'd be a function like download(df,num_rows="max") where num_rows i...

  • 1327 Views
  • 2 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 0 kudos

Question where do you want to download it to? If to cloud location, use regular DataFrameWriter. You can install, for example, Azure Storage Explorer on your computer. Some cloud storage you can even mount in your system as a folder or network share.

  • 0 kudos
1 More Replies
KrishZ
by Contributor
  • 13871 Views
  • 4 replies
  • 3 kudos

[Pyspark.Pandas] PicklingError: Could not serialize object (this error is happening only for large datasets)

Context: I am using pyspark.pandas in a Databricks jupyter notebook and doing some text manipulation within the dataframe..pyspark.pandas is the Pandas API on Spark and can be used exactly the same as usual PandasError: PicklingError: Could not seria...

  • 13871 Views
  • 4 replies
  • 3 kudos
Latest Reply
ryojikn
New Contributor III
  • 3 kudos

@Krishna Zanwar​ , i'm receiving the same error.​For me, the behavior is when trying to broadcast a random forest (sklearn 1.2.0) recently loaded from mlflow, and using Pandas UDF to predict a model.​However, the same code works perfectly on Spark 2....

  • 3 kudos
3 More Replies
jm99
by New Contributor III
  • 2819 Views
  • 1 replies
  • 1 kudos

Resolved! ForeachBatch() - Get results from batchDF._jdf.sparkSession().sql('merge stmt')

Most python examples show the structure of the foreachBatch method as:def foreachBatchFunc(batchDF, batchId): batchDF.createOrReplaceTempView('viewName') ( batchDF ._jdf.sparkSession() .sql( ...

  • 2819 Views
  • 1 replies
  • 1 kudos
Latest Reply
jm99
New Contributor III
  • 1 kudos

Just found a solution...Need to convert the Java Dataframe (jdf) to a DataFramefrom pyspark import sql   def batchFunc(batchDF, batchId): batchDF.createOrReplaceTempView('viewName') sparkSession = batchDF._jdf.sparkSession()   resJdf = sparkSes...

  • 1 kudos
lmcglone
by New Contributor II
  • 4233 Views
  • 2 replies
  • 3 kudos

Comparing 2 dataframes and create columns from values within a dataframe

Hi,I have a dataframe that has name and companyfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()columns = ["company","name"]data = [("company1", "Jon"), ("company2", "Steve"), ("company1", "...

image
  • 4233 Views
  • 2 replies
  • 3 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 3 kudos

You need to join and pivotdf .join(df2, on=[df.company == df2.job_company])) .groupBy("company", "name") .pivot("job_company") .count()

  • 3 kudos
1 More Replies
Mado
by Valued Contributor II
  • 16182 Views
  • 1 replies
  • 0 kudos

Resolved! How to show all rows by "DataFrame.show()"?

Hi,DataFrame.show() has a parameter n to set "Number of rows to show".Is there any way to show all rows?

  • 16182 Views
  • 1 replies
  • 0 kudos
Latest Reply
sher
Valued Contributor II
  • 0 kudos

Hi Medothis method will work fine df.show(df.count())

  • 0 kudos
SIRIGIRI
by Contributor
  • 1609 Views
  • 3 replies
  • 2 kudos

sharikrishna26.medium.com

Spark Dataframes SchemaSchema inference is not reliable.We have the following problems in schema inference:Automatic inferring of schema is often incorrectInferring schema is additional work for Spark, and it takes some extra timeSchema inference is ...

  • 1609 Views
  • 3 replies
  • 2 kudos
Latest Reply
Varshith
New Contributor III
  • 2 kudos

one other difference between those 2 approaches is that In Schema DDL String approach we use STRING, INT etc.. But In Struct Type Object approach we can only use Spark datatypes such as StringType(), IntegerType(), etc..

  • 2 kudos
2 More Replies
SIRIGIRI
by Contributor
  • 704 Views
  • 2 replies
  • 2 kudos

sharikrishna26.medium.com

Spark Dataframe MetadataSpark Dataframe is structurally the same as the table. However, it does not store any schema information in the metadata store. Instead, we have a runtime metadata catalog to store the Dataframe schema information. It is simil...

  • 704 Views
  • 2 replies
  • 2 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 2 kudos

this is awesome thanks

  • 2 kudos
1 More Replies
Labels