Topics with Label: Dataframe

Forum Posts

Sorted by:

Start a conversation

by Gk • New Contributor III

03-03-2023 3:18:10 AM

1551 Views
2 replies
1 kudos

DataFrame

How can we create empty dataframe in databricks and how many ways we can create dataframe?

Data Engineering

1551 Views
2 replies
1 kudos

03-03-2023 3:18:10 AM

View Replies

Latest Reply

Vartika
Moderator

03-31-2023 12:09:26 AM

1 kudos

Hi @Govardhana Reddy Hope everything is going great.Does @Suteja Kanuri's answer help? If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. Cheers!

1 kudos

03-31-2023 12:09:26 AM

1 More Replies

by andrew0117 • Contributor

03-26-2023 9:04:50 PM

1642 Views
4 replies
0 kudos

Resolved! Can merge() function be applied to dataframe?

if I have two dataframes df_target and df_source, can I do df_target.as("t).merge(df_source.as("s"), "s.id=t.id").whenMatched().updateAll().whenNotMatched.insertAll.execute(). when I tried the code above, I got the error "merge is not a member of the...

Data Engineering

1642 Views
4 replies
0 kudos

03-26-2023 9:04:50 PM

View Replies

Latest Reply

Anonymous
Not applicable

03-27-2023 9:10:57 PM

0 kudos

Hi @andrew li Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thanks!

0 kudos

03-27-2023 9:10:57 PM

3 More Replies

by nirajtanwar • New Contributor

03-16-2023 4:29:00 AM

920 Views
2 replies
2 kudos

To collect the elements of a SparkDataFrame and coerces them into an R dataframe.

Hello Everyone,I am facing the challenge while collecting a spark dataframe into an R dataframe, this I need to do as I am using TraMineR algorithm whih is implemented in R only and the data pre-processing I have done in pysparkI am trying this:event...

Data Engineering

920 Views
2 replies
2 kudos

03-16-2023 4:29:00 AM

View Replies

Latest Reply

Anonymous
Not applicable

03-17-2023 11:20:04 PM

2 kudos

Hi @Niraj Tanwar Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thank...

2 kudos

03-17-2023 11:20:04 PM

1 More Replies

by Dale_Ware • New Contributor III

03-14-2023 11:31:35 AM

1348 Views
2 replies
3 kudos

Resolved! How to query a table with backslashes in the name.

I am trying to query a snowflake table from a databricks data frame similar to the following example.sql_query = "select * from Database.Schema.Table_/Name_/V"sqlContext.sql(f"{sql_query}" ) And I get an error like this.ParseException: [PARSE_SYNTAX_...

Data Engineering

1348 Views
2 replies
3 kudos

03-14-2023 11:31:35 AM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

03-14-2023 9:20:17 PM

3 kudos

You can use Double Quotes to get the plan. Using quotes it is important to write the table names in capital letters.SELECT * FROM "/TABLE/NAME"

3 kudos

03-14-2023 9:20:17 PM

1 More Replies

by Merchiv • New Contributor III

02-03-2023 7:34:30 AM

7891 Views
4 replies
3 kudos

Resolved! How can I add a duration in milliseconds to a timestamp?

Let's say I have a DataFrame with a timestamp and an offset column in milliseconds respectively in the timestamp and long format. E.g.from datetime import datetime df = spark.createDataFrame( [ (datetime(2021, 1, 1), 1500, ), (dat...

Data Engineering

7891 Views
4 replies
3 kudos

02-03-2023 7:34:30 AM

View Replies

Latest Reply

Merchiv
New Contributor III

03-01-2023 11:41:35 PM

3 kudos

Although @Lakshay Goel's solution works, we've been using an alternative approach, that we found to be a bit more readable:from pyspark.sql import Column, functions as f def make_dt_interval_sec(col: Column): return f.expr(f"make_dt_interval...

3 kudos

03-01-2023 11:41:35 PM

3 More Replies

by tinendra • New Contributor III

10-28-2022 6:47:42 AM

1885 Views
7 replies
8 kudos

Can we run pandas dataframe inside databricks?

Hi, I want to run df=pd.read_csv('/dbfs/FileStore/airlines1.csv') while trying to run getting error likeFileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStore/airlines1.csv'Could you please help me out how to run pandas dataframe in...

Data Engineering

1885 Views
7 replies
8 kudos

10-28-2022 6:47:42 AM

View Replies

Latest Reply

Anonymous
Not applicable

01-08-2023 9:10:36 PM

8 kudos

Hi @Tinendra Kumar Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

8 kudos

01-08-2023 9:10:36 PM

6 More Replies

by cristianc • Contributor

02-01-2023 2:14:30 AM

829 Views
2 replies
0 kudos

Issue with visualizing dataframe from a job

Greetings,I have the following data set:```sqlSELECT * FROM ( VALUES ('2023-02',113.81::decimal(27,2),'A','X'), ('2023-02',112.66::decimal(27,2),'A','Y'), ('2023-02',1223.8::decimal(27,2),'B','X'), ('2023-02',1234.56::decimal(27,2),'B',...

Data Engineering

829 Views
2 replies
0 kudos

02-01-2023 2:14:30 AM

View Replies

Latest Reply

cristianc
Contributor

02-02-2023 1:15:40 AM

0 kudos

Attaching some more screenshots to add more details.This seems to be a bug in the bar chart visualization widget when displaying from job run.

0 kudos

02-02-2023 1:15:40 AM

1 More Replies

by jonathan-dufaul • Valued Contributor

01-24-2023 7:28:41 AM

850 Views
2 replies
0 kudos

Is there a function similar to display that downloads a dataframe?

I find myself constantly having to do display(df), and then "recompute with <5g records and download). I was just hoping I could skip the middleman and download from get go. ideally it'd be a function like download(df,num_rows="max") where num_rows i...

Data Engineering

850 Views
2 replies
0 kudos

01-24-2023 7:28:41 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-24-2023 10:26:35 AM

0 kudos

Question where do you want to download it to? If to cloud location, use regular DataFrameWriter. You can install, for example, Azure Storage Explorer on your computer. Some cloud storage you can even mount in your system as a folder or network share.

0 kudos

01-24-2023 10:26:35 AM

1 More Replies

by KrishZ • Contributor

09-11-2022 7:49:10 AM

10385 Views
4 replies
3 kudos

[Pyspark.Pandas] PicklingError: Could not serialize object (this error is happening only for large datasets)

Context: I am using pyspark.pandas in a Databricks jupyter notebook and doing some text manipulation within the dataframe..pyspark.pandas is the Pandas API on Spark and can be used exactly the same as usual PandasError: PicklingError: Could not seria...

Data Engineering

10385 Views
4 replies
3 kudos

09-11-2022 7:49:10 AM

View Replies

Latest Reply

ryojikn
New Contributor III

01-14-2023 9:06:21 PM

3 kudos

@Krishna Zanwar , i'm receiving the same error.For me, the behavior is when trying to broadcast a random forest (sklearn 1.2.0) recently loaded from mlflow, and using Pandas UDF to predict a model.However, the same code works perfectly on Spark 2....

3 kudos

01-14-2023 9:06:21 PM

3 More Replies

by jm99 • New Contributor III

01-13-2023 1:20:49 AM

1909 Views
1 replies
1 kudos

Resolved! ForeachBatch() - Get results from batchDF._jdf.sparkSession().sql('merge stmt')

Most python examples show the structure of the foreachBatch method as:def foreachBatchFunc(batchDF, batchId): batchDF.createOrReplaceTempView('viewName') ( batchDF ._jdf.sparkSession() .sql( ...

Data Engineering

1909 Views
1 replies
1 kudos

01-13-2023 1:20:49 AM

View Replies

Latest Reply

jm99
New Contributor III

01-13-2023 4:14:57 AM

1 kudos

Just found a solution...Need to convert the Java Dataframe (jdf) to a DataFramefrom pyspark import sql def batchFunc(batchDF, batchId): batchDF.createOrReplaceTempView('viewName') sparkSession = batchDF._jdf.sparkSession() resJdf = sparkSes...

1 kudos

01-13-2023 4:14:57 AM

by lmcglone • New Contributor II

01-11-2023 8:08:37 AM

2558 Views
2 replies
3 kudos

Comparing 2 dataframes and create columns from values within a dataframe

Hi,I have a dataframe that has name and companyfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()columns = ["company","name"]data = [("company1", "Jon"), ("company2", "Steve"), ("company1", "...

Data Engineering

2558 Views
2 replies
3 kudos

01-11-2023 8:08:37 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-11-2023 8:59:13 AM

3 kudos

You need to join and pivotdf .join(df2, on=[df.company == df2.job_company])) .groupBy("company", "name") .pivot("job_company") .count()

3 kudos

01-11-2023 8:59:13 AM

1 More Replies

by databicky • Contributor II

01-02-2023 1:08:45 AM

9514 Views
12 replies
4 kudos

How can we write a pandas dataframe into azure adls as excel file, when trying to write it is showing error as protocol not known 'abfss' like that.

Data Engineering

9514 Views
12 replies
4 kudos

01-02-2023 1:08:45 AM

View Replies

Latest Reply

FerArribas
Contributor

01-02-2023 1:25:43 PM

4 kudos

Hi @Hubert Dudek,Pandas API doesn't support abfss protocol.You have three options:If you need to use pandas, you can write the excel to the local file system (dbfs) and then move it to ABFSS (for example with dbutils)Write as csv directly in abfss...

4 kudos

01-02-2023 1:25:43 PM

11 More Replies

by Mado • Valued Contributor II

12-15-2022 3:02:41 AM

10320 Views
1 replies
0 kudos

Resolved! How to show all rows by "DataFrame.show()"?

Hi,DataFrame.show() has a parameter n to set "Number of rows to show".Is there any way to show all rows?

Data Engineering

10320 Views
1 replies
0 kudos

12-15-2022 3:02:41 AM

View Replies

Latest Reply

sher
Valued Contributor II

01-03-2023 8:43:29 PM

0 kudos

Hi Medothis method will work fine df.show(df.count())

0 kudos

01-03-2023 8:43:29 PM

by SIRIGIRI • Contributor

12-31-2022 5:38:45 AM

848 Views
3 replies
2 kudos

sharikrishna26.medium.com

Spark Dataframes SchemaSchema inference is not reliable.We have the following problems in schema inference:Automatic inferring of schema is often incorrectInferring schema is additional work for Spark, and it takes some extra timeSchema inference is ...

Data Engineering

848 Views
3 replies
2 kudos

12-31-2022 5:38:45 AM

View Replies

Latest Reply

Varshith
New Contributor III

01-01-2023 7:05:25 PM

2 kudos

one other difference between those 2 approaches is that In Schema DDL String approach we use STRING, INT etc.. But In Struct Type Object approach we can only use Spark datatypes such as StringType(), IntegerType(), etc..

2 kudos

01-01-2023 7:05:25 PM

2 More Replies

by SIRIGIRI • Contributor

12-26-2022 8:07:02 AM

380 Views
2 replies
2 kudos

sharikrishna26.medium.com

Spark Dataframe MetadataSpark Dataframe is structurally the same as the table. However, it does not store any schema information in the metadata store. Instead, we have a runtime metadata catalog to store the Dataframe schema information. It is simil...

Data Engineering

380 Views
2 replies
2 kudos

12-26-2022 8:07:02 AM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

12-26-2022 5:00:51 PM

2 kudos

this is awesome thanks

2 kudos

12-26-2022 5:00:51 PM

1 More Replies