Topics with Label: Pyspark

Forum Posts

Sorted by:

by ros • New Contributor III

05-31-2023 12:47:59 AM

676 Views
2 replies
2 kudos

merge vs MERGE INTO

from 10.4 LTS version we have low shuffle merge, so merge is more faster. But what about MERGE INTO function that we run in sql notebook of databricks. Is there any performance difference when we use databrciks pyspark ".merge" function vs databricks...

Data Engineering

676 Views
2 replies
2 kudos

05-31-2023 12:47:59 AM

View Replies

Latest Reply

Anonymous
Not applicable

06-01-2023 12:10:35 AM

2 kudos

Hi @Roshan RC Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers you...

2 kudos

06-01-2023 12:10:35 AM

1 More Replies

by gdoron • New Contributor

05-29-2023 4:16:47 PM

805 Views
2 replies
0 kudos

using pyspark can I write to an s3 path I don't have GetObject permission to?

After spark finishes writing the dataframe to S3, it seems like it checks the validity of the files it wrote with: `getFileStatus` that is `HeadObject` behind the scenes.What if I'm only granted write and list objects permissions but not GetObject? I...

Data Engineering

805 Views
2 replies
0 kudos

05-29-2023 4:16:47 PM

View Replies

Latest Reply

Lakshay
Esteemed Contributor

05-30-2023 4:00:29 AM

0 kudos

It is not possible in my opinion.

0 kudos

05-30-2023 4:00:29 AM

1 More Replies

by Enthusiastic_Da • New Contributor II

05-17-2023 7:54:05 AM

1466 Views
0 replies
0 kudos

how to read columns dynamically using pyspark

I have a table called MetaData and what columns are needed in the select are stored in MetaData.columnsI would like to read columns dynamically from MetaData.columns and create a view based on that.csv_values = "col1, col2, col3, col4"df = spark.crea...

Data Engineering

1466 Views
0 replies
0 kudos

05-17-2023 7:54:05 AM

by frank7 • New Contributor II

04-28-2023 12:25:13 PM

1629 Views
2 replies
1 kudos

Resolved! Is it possible to write a pyspark dataframe to a custom log table in Log Analytics workspace?

I have a pyspark dataframe that contains information about the tables that I have on sql database (creation date, number of rows, etc)Sample data: { "Day":"2023-04-28", "Environment":"dev", "DatabaseName":"default", "TableName":"discount"...

Data Engineering

1629 Views
2 replies
1 kudos

04-28-2023 12:25:13 PM

View Replies

Latest Reply

Anonymous
Not applicable

05-13-2023 9:55:43 AM

1 kudos

@Bruno Simoes :Yes, it is possible to write a PySpark DataFrame to a custom log table in Log Analytics workspace using the Azure Log Analytics Workspace API.Here's a high-level overview of the steps you can follow:Create an Azure Log Analytics Works...

1 kudos

05-13-2023 9:55:43 AM

1 More Replies

by Chalki • New Contributor III

05-12-2023 6:17:44 AM

1781 Views
2 replies
4 kudos

Resolved! Delta Table Merge statement is not accepting broadcast hint

I have a statement like this with pyspark:target_tbl.alias("target")\ .merge(stage_df.hint("broadcast").alias("source"), merge_join_expr)\ .whenMatchedUpdateAll()\ .whenNotMatchedInsertAll()\ .w...

Data Engineering

1781 Views
2 replies
4 kudos

05-12-2023 6:17:44 AM

View Replies

Latest Reply

Anonymous
Not applicable

05-13-2023 6:06:28 PM

4 kudos

Hi @Nikolay Chalkanov Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best ans...

4 kudos

05-13-2023 6:06:28 PM

1 More Replies

by DeviJaviya • New Contributor II

04-29-2023 10:15:56 PM

1099 Views
2 replies
0 kudos

Trying to build subquery in Databricks notebook, similar to SQL in a data frame with the Top(1)

Hello Everyone,I am new to Databricks, so I am at the learning stage. It would be very helpful if someone helps in resolving the issue or I can say helped me to fix my code.I have built the query that fetches the data based on CASE, in Case I have a ...

Data Engineering

1099 Views
2 replies
0 kudos

04-29-2023 10:15:56 PM

View Replies

Latest Reply

DeviJaviya
New Contributor II

05-03-2023 8:40:16 PM

0 kudos

Hello Rishabh,Thank you for your suggestion, we tried to limit 1 but the output values are coming the same for all the dates. which is not correct.

0 kudos

05-03-2023 8:40:16 PM

1 More Replies

by Khalil • Contributor

04-19-2023 10:38:54 AM

3443 Views
6 replies
5 kudos

Resolved! Pivot a DataFrame in Delta Live Table DLT

I wanna apply a pivot on a dataframe in DLT but I'm having the following warningNotebook:XXXX used `GroupedData.pivot` function that will be deprecated soon. Please fix the notebook.I have the same warning if I use the the function collect.Is it risk...

Data Engineering

3443 Views
6 replies
5 kudos

04-19-2023 10:38:54 AM

View Replies

Latest Reply

Khalil
Contributor

04-26-2023 10:09:52 AM

5 kudos

Thanks @Kaniz Fatma for your support.The solution was to do the pivot outside of views or tables and the warning disappeared.

5 kudos

04-26-2023 10:09:52 AM

5 More Replies

by Dean_Lovelace • New Contributor III

04-17-2023 7:14:04 AM

1280 Views
3 replies
4 kudos

What is the Pyspark equivalent of FSCK REPAIR TABLE?

I am using the delta format and occasionaly get the following error:-"xx.parquet referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement"FS...

Data Engineering

1280 Views
3 replies
4 kudos

04-17-2023 7:14:04 AM

View Replies

Latest Reply

shan_chandra
Honored Contributor III

04-19-2023 7:40:38 AM

4 kudos

## Delta check when a file was added %scala (oldest-version-available to newest-version-available).map { version => var df = spark.read.json(f"<delta-table-location>/_delta_log/$version%020d.json").where("add is not null").select("add.path") var ...

4 kudos

04-19-2023 7:40:38 AM

2 More Replies

by Data_Engineer3 • Contributor II

04-12-2023 3:33:34 AM

3651 Views
4 replies
5 kudos

How can i use the same spark session from onenotebook to another notebook in databricks

I want to use the same spark session which created in one notebook and need to be used in another notebook in across same environment, Example, if some of the (variable)object got initialized in the first notebook, i need to use the same object in t...

Data Engineering

3651 Views
4 replies
5 kudos

04-12-2023 3:33:34 AM

View Replies

Latest Reply

Manoj12421
Valued Contributor II

04-18-2023 12:39:19 AM

5 kudos

You can use %run and then use the location of the notebook - %run "/folder/notebookname"

5 kudos

04-18-2023 12:39:19 AM

3 More Replies

by Merchiv • New Contributor III

04-14-2023 12:13:55 AM

1764 Views
4 replies
0 kudos

Difference between Databricks and local pyspark split.

I have noticed some inconsistent behavior between calling the 'split' fuction on databricks and on my local installation.Running it in a databricks notebook givesspark.sql("SELECT split('abc', ''), size(split('abc',''))").show()So the string is split...

Data Engineering

1764 Views
4 replies
0 kudos

04-14-2023 12:13:55 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-16-2023 12:26:44 AM

0 kudos

@Ivo Merchiers :The behavior you are seeing is likely due to differences in the underlying version of Apache Spark between your local installation and Databricks. split() is a function provided by Spark's SQL functions, and different versions of Spa...

0 kudos

04-16-2023 12:26:44 AM

3 More Replies

by Nandini • New Contributor II

12-05-2022 12:19:47 AM

7229 Views
10 replies
7 kudos

Pyspark: You cannot use dbutils within a spark job

I am trying to parallelise the execution of file copy in Databricks. Making use of multiple executors is one way. So, this is the piece of code that I wrote in pyspark.def parallel_copy_execution(src_path: str, target_path: str): files_in_path = db...

Data Engineering

7229 Views
10 replies
7 kudos

12-05-2022 12:19:47 AM

View Replies

Latest Reply

Etyr
Contributor

01-11-2023 2:33:17 AM

7 kudos

If you have spark session, you can use Spark hidden File System:# Get FileSystem from SparkSession fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) # Get Path class to convert string path to FS path path = spark._...

7 kudos

01-11-2023 2:33:17 AM

9 More Replies

by RateVan • New Contributor II

04-01-2023 4:31:49 AM

1296 Views
3 replies
0 kudos

Spark last window dont flush in append mode

The problem is very simple, when you use TUMBLING window with append mode, then the window is closed only when the next message arrives (+watermark logic). In the current implementation, if you stop incoming streaming data, the last window will NEVER...

Data Engineering

1296 Views
3 replies
0 kudos

04-01-2023 4:31:49 AM

View Replies

Latest Reply

RateVan
New Contributor II

04-13-2023 3:37:43 AM

0 kudos

No, the problem remains the same. The meaning doesn't change because you increased the timeout a little bit. As the window did not close, and does not close until a new message arrives

0 kudos

04-13-2023 3:37:43 AM

2 More Replies

by danniely • New Contributor II

01-31-2023 7:25:36 AM

2996 Views
1 replies
2 kudos

Pyspark RDD fails with pytest

when I call RDD Apis during pytest, it seems like module "serializer.py" cannot find any other modules under pyspark.I've already looked up on the internet, and it seems like pyspark modules are not properly importing other referring modules.I see ot...

Data Engineering

2996 Views
1 replies
2 kudos

01-31-2023 7:25:36 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-10-2023 7:09:42 AM

2 kudos

@hyunho lee : It sounds like you are encountering an issue with PySpark's serializer not being able to find the necessary modules during testing with Pytest. One solution you could try is to set the PYTHONPATH environment variable to include the pat...

2 kudos

04-10-2023 7:09:42 AM

by quakenbush • Contributor

01-30-2023 6:01:16 AM

2293 Views
1 replies
0 kudos

Is there something like Oracle's VPD-Feature in Databricks?

Since I am porting some code from Oracle to Databricks, I have another specific question.In Oracle there's something called Virtual Private Database, VPD. It's a simple security feature used to generate a WHERE-clause which the system will add to a u...

Data Engineering

2293 Views
1 replies
0 kudos

01-30-2023 6:01:16 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-10-2023 7:07:58 AM

0 kudos

@Roger Bieri :In Databricks, you can use the UserDefinedFunction (UDF) feature to create a custom function that will be applied to a DataFrame. You can use this feature to add a WHERE clause to a DataFrame based on the user context. Here's an exampl...

0 kudos

04-10-2023 7:07:58 AM

by elgeo • Valued Contributor II

02-13-2023 5:07:31 AM

2699 Views
2 replies
0 kudos

Trasform SQL Cursor using Pyspark in Databricks

We have a Cursor in DB2 which reads in each loop data from 2 tables. At the end of each loop, after inserting the data to a target table, we update records related to each loop in these 2 tables before moving to the next loop. An indicative example i...

Data Engineering

2699 Views
2 replies
0 kudos

02-13-2023 5:07:31 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-10-2023 3:11:21 AM

0 kudos

Hi @ELENI GEORGOUSI Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

0 kudos

04-10-2023 3:11:21 AM

1 More Replies