Topics with Label: Pyspark Dataframe

Forum Posts

Sorted by:

by tinendra • New Contributor III

02-14-2023 4:19:07 AM

2095 Views
5 replies
5 kudos

How to reduce time while loading data into the azure synapse table?

Hi All,I just wanted to know if is there any option to reduce time while loading Pyspark Dataframe into the Azure synapse table using Databricks.like..I have a pyspark dataframe that has around 40k records and I am trying to load data into the azure ...

Data Engineering

2095 Views
5 replies
5 kudos

02-14-2023 4:19:07 AM

View Replies

Latest Reply

Anonymous
Not applicable

02-16-2023 9:39:53 PM

5 kudos

Hi @Tinendra Kumar Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

5 kudos

02-16-2023 9:39:53 PM

4 More Replies

by pramalin • New Contributor

01-30-2023 10:33:47 AM

1507 Views
3 replies
2 kudos

How to perform Inner join using withcolumn

Data Engineering

1507 Views
3 replies
2 kudos

01-30-2023 10:33:47 AM

View Replies

Latest Reply

shan_chandra
Honored Contributor III

01-31-2023 7:55:15 AM

2 kudos

@prudhvi ramalingam - Please refer to the below example code.import org.apache.spark.sql.functions.expr val person = Seq( (0, "Bill Chambers", 0, Seq(100)), (1, "Matei Zaharia", 1, Seq(500, 250, 100)), (2, "Michael Armbrust", 1, Seq(250,...

2 kudos

01-31-2023 7:55:15 AM

2 More Replies

by BF • New Contributor II

01-28-2023 4:51:34 AM

3253 Views
3 replies
2 kudos

Resolved! Pyspark - How do I convert date/timestamp of format like /Date(1593786688000+0200)/ in pyspark?

Hi all, I've a dataframe with CreateDate column with this format:CreateDate/Date(1593786688000+0200)//Date(1446032157000+0100)//Date(1533904635000+0200)//Date(1447839805000+0100)//Date(1589451249000+0200)/and I want to convert that format to date/tim...

Data Engineering

3253 Views
3 replies
2 kudos

01-28-2023 4:51:34 AM

View Replies

Latest Reply

Chaitanya_Raju
Honored Contributor

01-28-2023 8:34:43 PM

2 kudos

Hi @Bruno Franco ,Can you please try the below code, hope it might for you.from pyspark.sql.functions import from_unixtime from pyspark.sql import functions as F final_df = df_src.withColumn("Final_Timestamp", from_unixtime((F.regexp_extract(col("Cr...

2 kudos

01-28-2023 8:34:43 PM

2 More Replies

by rammy • Contributor III

12-02-2022 10:29:02 AM

1375 Views
2 replies
3 kudos

How can we save a data frame in Docx format using pyspark?

I am trying to save a data frame into a document but it returns saying that the below errorjava.lang.ClassNotFoundException: Failed to find data source: docx. Please find packages at http://spark.apache.org/third-party-projects.htm #f_d...

Data Engineering

1375 Views
2 replies
3 kudos

12-02-2022 10:29:02 AM

View Replies

Latest Reply

jose_gonzalez
Moderator

01-25-2023 3:36:58 PM

3 kudos

Hi,You cannot do it from Pyspark, but you can try to use Pandas to save to Excell. There is no Docx

3 kudos

01-25-2023 3:36:58 PM

1 More Replies

by KrishZ • Contributor

09-11-2022 7:49:10 AM

10432 Views
4 replies
3 kudos

[Pyspark.Pandas] PicklingError: Could not serialize object (this error is happening only for large datasets)

Context: I am using pyspark.pandas in a Databricks jupyter notebook and doing some text manipulation within the dataframe..pyspark.pandas is the Pandas API on Spark and can be used exactly the same as usual PandasError: PicklingError: Could not seria...

Data Engineering

10432 Views
4 replies
3 kudos

09-11-2022 7:49:10 AM

View Replies

Latest Reply

ryojikn
New Contributor III

01-14-2023 9:06:21 PM

3 kudos

@Krishna Zanwar , i'm receiving the same error.For me, the behavior is when trying to broadcast a random forest (sklearn 1.2.0) recently loaded from mlflow, and using Pandas UDF to predict a model.However, the same code works perfectly on Spark 2....

3 kudos

01-14-2023 9:06:21 PM

3 More Replies

by SRK • Contributor III

12-21-2022 8:29:40 AM

4537 Views
2 replies
0 kudos

How to get the count of dataframe rows when reading through spark.readstream using batch jobs?

I am trying to read messages from kafka topic using spark.readstream, I am using the following code to read it.My CODE:df = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "192.1xx.1.1xx:9xx") .option("subscr...

Data Engineering

4537 Views
2 replies
0 kudos

12-21-2022 8:29:40 AM

View Replies

Latest Reply

daniel_sahal
Esteemed Contributor

12-22-2022 5:13:54 AM

0 kudos

You can try this approach:https://stackoverflow.com/questions/57568038/how-to-see-the-dataframe-in-the-console-equivalent-of-show-for-structured-st/62161733#62161733ReadStream is running a thread in background so there's no easy way like df.show().

0 kudos

12-22-2022 5:13:54 AM

1 More Replies

by Ancil • Contributor II

12-01-2022 4:59:35 AM

9542 Views
11 replies
1 kudos

Any on please suggest how we can effectively loop through PySpark Dataframe .

Scenario: I Have a dataframe with more than 1000 rows, each row having a file path and result data column. I need to loop through each row and write files to the file path, with data from the result column.what is the easiest and time effective way ...

Data Engineering

9542 Views
11 replies
1 kudos

12-01-2022 4:59:35 AM

View Replies

Latest Reply

NhatHoang
Valued Contributor II

12-01-2022 7:28:07 PM

1 kudos

Hi,I agree with Werners, try to avoid loop with Pyspark Dataframe.If your dataframe is small, as you said, only about 1000 rows, you may consider to use Pandas.Thanks.

1 kudos

12-01-2022 7:28:07 PM

10 More Replies

by DK03 • Contributor

11-30-2022 5:13:04 AM

968 Views
2 replies
2 kudos

Is it ok to join on the decimal type fields? How does it affect the performance?

Data Engineering

968 Views
2 replies
2 kudos

11-30-2022 5:13:04 AM

View Replies

Latest Reply

UmaMahesh1
Honored Contributor III

11-30-2022 10:02:46 AM

2 kudos

As @Werner Stinckens said, it would be ok. But generally decimal column joins are not recommended as other factors come into play like the precision, length etc...Also when you are joining in on decimal columns, be sure to check out the abs value of...

2 kudos

11-30-2022 10:02:46 AM

1 More Replies

by Mado • Valued Contributor II

11-29-2022 10:55:53 PM

19337 Views
3 replies
10 kudos

Resolved! How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns?

Hi,I need to find all occurrences of duplicate records in a PySpark DataFrame. Following is the sample dataset:# Prepare Data data = [("A", "A", 1), \ ("A", "A", 2), \ ("A", "A", 3), \ ("A", "B", 4), \ ("A", "B", 5), \ ("A", "C", ...

Data Engineering

19337 Views
3 replies
10 kudos

11-29-2022 10:55:53 PM

View Replies

Latest Reply

NhatHoang
Valued Contributor II

11-30-2022 1:30:19 AM

10 kudos

Hi,In my experience, if you use dropDuplicates(), Spark will keep a random row.Therefore, you should define a logic to remove duplicated rows.

10 kudos

11-30-2022 1:30:19 AM

2 More Replies

by rammy • Contributor III

11-21-2022 10:17:34 PM

1584 Views
3 replies
11 kudos

How would i retrieve data JSON data with namespaces using spark SQL?

File.json from the below code contains huge JSON data with each key containing namespace prefix(This JSON file converted from the XML file).I could able to retrieve if JSON does not contain namespaces but what could be the approach to retrieve record...

Data Engineering

1584 Views
3 replies
11 kudos

11-21-2022 10:17:34 PM

View Replies

Latest Reply

SS2
Valued Contributor

11-29-2022 12:45:22 PM

11 kudos

I case of struct you can use (.) For extracting the value

11 kudos

11-29-2022 12:45:22 PM

2 More Replies

by tassiodahora • New Contributor III

05-23-2022 6:00:35 AM

43497 Views
3 replies
8 kudos

Resolved! Failed to merge incompatible data types LongType and StringType

Guys, good morning!I am writing the results of a json in a delta table, only the json structure is not always the same, if the field does not list in the json it generates type incompatibility when I append(dfbrzagend.write .format("delta") .mode("ap...

Data Engineering

43497 Views
3 replies
8 kudos

05-23-2022 6:00:35 AM

View Replies

Latest Reply

Kaniz
Community Manager

06-14-2022 10:05:20 AM

8 kudos

Hi @Tássio Santos , We haven’t heard from you on the last response from @Chetan Kardekar , and I was checking back to see if you have a resolution yet. If you have any solution, please share it with the community as it can be helpful to others. Oth...

8 kudos

06-14-2022 10:05:20 AM

2 More Replies

by rajat1 • New Contributor

09-27-2022 9:46:37 PM

9785 Views
3 replies
2 kudos

How to convert dataframe (df), to a excel file that I can share with my colleagues ?

I am working on microsoft azure databrick, I have a final dataframe of shape (3276*23) , I want to share it in form of excel file? How can I do it ( I am using ->df.to_excel('fileOutput.xlsx', sheet_name = 'Sheet1', index = False) , command is runn...

Data Engineering

9785 Views
3 replies
2 kudos

09-27-2022 9:46:37 PM

View Replies

Latest Reply

Anonymous
Not applicable

11-19-2022 2:15:43 AM

2 kudos

You could try this way, convert Pyspark Dataframe to Pandas Dataframe then export to excel file.

2 kudos

11-19-2022 2:15:43 AM

2 More Replies

by wyzer • Contributor II

11-18-2022 8:25:08 AM

2703 Views
2 replies
12 kudos

Resolved! Add the creation date of a parquet file into a DataFrame

Currently I load multiple parquet file with this code:df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")(Inside the Voucher folder, there is one folder by date. Each one containing one parquet file)How can I add a column into this DataFrame, that...

Data Engineering

2703 Views
2 replies
12 kudos

11-18-2022 8:25:08 AM

View Replies

Latest Reply

wyzer
Contributor II

11-18-2022 12:46:00 PM

12 kudos

Thanks @Michail Karamanos

12 kudos

11-18-2022 12:46:00 PM

1 More Replies

by Mado • Valued Contributor II

10-17-2022 3:11:09 PM

2889 Views
4 replies
2 kudos

Resolved! Pandas API on Spark, Does it run on a multi-node cluster?

Hi, I have a few questions about "Pandas API on Spark". Thanks for your time to read my questions1) Input to these functions are Pandas DataFrame or PySpark DataFrame?2) When I use any pandas function (like isna, size, apply, where, etc ), does it ru...

Data Engineering

2889 Views
4 replies
2 kudos

10-17-2022 3:11:09 PM

View Replies

Latest Reply

Debayan
Esteemed Contributor III

10-18-2022 5:46:39 AM

2 kudos

Hi @Mohammad Saber , Pandas dataset lives in the single machine, and is naturally iterable locally within the same machine. However, pandas-on-Spark dataset lives across multiple machines, and they are computed in a distributed manner. It is difficu...

2 kudos

10-18-2022 5:46:39 AM

3 More Replies

by Mado • Valued Contributor II

10-22-2022 3:38:00 AM

1274 Views
2 replies
3 kudos

How to apply Pandas functions on PySpark DataFrame?

Hi, I want to apply Pandas functions (like isna, concat, append, etc) on PySpark DataFrame in such a way that computations are done on multi-node cluster.I don't want to convert PySpark DataFrame into Pandas DataFrame since, I think, only one node is...

Data Engineering

1274 Views
2 replies
3 kudos

10-22-2022 3:38:00 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

10-23-2022 2:00:08 PM

3 kudos

The best is to use pandas on a spark, it is virtually interchangeable so it just different API for Spark data frameimport pyspark.pandas as ps psdf = ps.range(10) sdf = psdf.to_spark().filter("id > 5") sdf.show()

3 kudos

10-23-2022 2:00:08 PM

1 More Replies