cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

ros
by New Contributor III
  • 676 Views
  • 2 replies
  • 2 kudos

merge vs MERGE INTO

from 10.4 LTS version we have low shuffle merge, so merge is more faster. But what about MERGE INTO function that we run in sql notebook of databricks. Is there any performance difference when we use databrciks pyspark ".merge" function vs databricks...

  • 676 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Roshan RC​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers you...

  • 2 kudos
1 More Replies
gdoron
by New Contributor
  • 805 Views
  • 2 replies
  • 0 kudos

using pyspark can I write to an s3 path I don't have GetObject permission to?

After spark finishes writing the dataframe to S3, it seems like it checks the validity of the files it wrote with: `getFileStatus` that is `HeadObject` behind the scenes.What if I'm only granted write and list objects permissions but not GetObject? I...

  • 805 Views
  • 2 replies
  • 0 kudos
Latest Reply
Lakshay
Esteemed Contributor
  • 0 kudos

It is not possible in my opinion.

  • 0 kudos
1 More Replies
Enthusiastic_Da
by New Contributor II
  • 1466 Views
  • 0 replies
  • 0 kudos

how to read columns dynamically using pyspark

I have a table called MetaData and what columns are needed in the select are stored in MetaData.columnsI would like to read columns dynamically from MetaData.columns and create a view based on that.csv_values = "col1, col2, col3, col4"df = spark.crea...

  • 1466 Views
  • 0 replies
  • 0 kudos
frank7
by New Contributor II
  • 1629 Views
  • 2 replies
  • 1 kudos

Resolved! Is it possible to write a pyspark dataframe to a custom log table in Log Analytics workspace?

I have a pyspark dataframe that contains information about the tables that I have on sql database (creation date, number of rows, etc)Sample data: { "Day":"2023-04-28", "Environment":"dev", "DatabaseName":"default", "TableName":"discount"...

  • 1629 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Bruno Simoes​ :Yes, it is possible to write a PySpark DataFrame to a custom log table in Log Analytics workspace using the Azure Log Analytics Workspace API.Here's a high-level overview of the steps you can follow:Create an Azure Log Analytics Works...

  • 1 kudos
1 More Replies
Chalki
by New Contributor III
  • 1781 Views
  • 2 replies
  • 4 kudos

Resolved! Delta Table Merge statement is not accepting broadcast hint

I have a statement like this with pyspark:target_tbl.alias("target")\            .merge(stage_df.hint("broadcast").alias("source"), merge_join_expr)\                .whenMatchedUpdateAll()\                .whenNotMatchedInsertAll()\                .w...

  • 1781 Views
  • 2 replies
  • 4 kudos
Latest Reply
Anonymous
Not applicable
  • 4 kudos

Hi @Nikolay Chalkanov​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best ans...

  • 4 kudos
1 More Replies
DeviJaviya
by New Contributor II
  • 1099 Views
  • 2 replies
  • 0 kudos

Trying to build subquery in Databricks notebook, similar to SQL in a data frame with the Top(1)

Hello Everyone,I am new to Databricks, so I am at the learning stage. It would be very helpful if someone helps in resolving the issue or I can say helped me to fix my code.I have built the query that fetches the data based on CASE, in Case I have a ...

  • 1099 Views
  • 2 replies
  • 0 kudos
Latest Reply
DeviJaviya
New Contributor II
  • 0 kudos

Hello Rishabh,Thank you for your suggestion, we tried to limit 1 but the output values are coming the same for all the dates. which is not correct.

  • 0 kudos
1 More Replies
Khalil
by Contributor
  • 3443 Views
  • 6 replies
  • 5 kudos

Resolved! Pivot a DataFrame in Delta Live Table DLT

I wanna apply a pivot on a dataframe in DLT but I'm having the following warningNotebook:XXXX used `GroupedData.pivot` function that will be deprecated soon. Please fix the notebook.I have the same warning if I use the the function collect.Is it risk...

  • 3443 Views
  • 6 replies
  • 5 kudos
Latest Reply
Khalil
Contributor
  • 5 kudos

Thanks @Kaniz Fatma​  for your support.The solution was to do the pivot outside of views or tables and the warning disappeared.

  • 5 kudos
5 More Replies
Dean_Lovelace
by New Contributor III
  • 1280 Views
  • 3 replies
  • 4 kudos

What is the Pyspark equivalent of FSCK REPAIR TABLE?

I am using the delta format and occasionaly get the following error:-"xx.parquet referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement"FS...

  • 1280 Views
  • 3 replies
  • 4 kudos
Latest Reply
shan_chandra
Honored Contributor III
  • 4 kudos

## Delta check when a file was added %scala (oldest-version-available to newest-version-available).map { version => var df = spark.read.json(f"<delta-table-location>/_delta_log/$version%020d.json").where("add is not null").select("add.path") var ...

  • 4 kudos
2 More Replies
Data_Engineer3
by Contributor II
  • 3651 Views
  • 4 replies
  • 5 kudos

How can i use the same spark session from onenotebook to another notebook in databricks

I want to use the same spark session which created in one notebook and need to be used in another notebook in across same environment, Example, if some of the (variable)object got initialized in the first notebook, i need to use the same object in t...

  • 3651 Views
  • 4 replies
  • 5 kudos
Latest Reply
Manoj12421
Valued Contributor II
  • 5 kudos

You can use %run and then use the location of the notebook - %run "/folder/notebookname"

  • 5 kudos
3 More Replies
Merchiv
by New Contributor III
  • 1764 Views
  • 4 replies
  • 0 kudos

Difference between Databricks and local pyspark split.

I have noticed some inconsistent behavior between calling the 'split' fuction on databricks and on my local installation.Running it in a databricks notebook givesspark.sql("SELECT split('abc', ''), size(split('abc',''))").show()So the string is split...

image.png
  • 1764 Views
  • 4 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Ivo Merchiers​ :The behavior you are seeing is likely due to differences in the underlying version of Apache Spark between your local installation and Databricks. split() is a function provided by Spark's SQL functions, and different versions of Spa...

  • 0 kudos
3 More Replies
Nandini
by New Contributor II
  • 7229 Views
  • 10 replies
  • 7 kudos

Pyspark: You cannot use dbutils within a spark job

I am trying to parallelise the execution of file copy in Databricks. Making use of multiple executors is one way. So, this is the piece of code that I wrote in pyspark.def parallel_copy_execution(src_path: str, target_path: str): files_in_path = db...

  • 7229 Views
  • 10 replies
  • 7 kudos
Latest Reply
Etyr
Contributor
  • 7 kudos

If you have spark session, you can use Spark hidden File System:# Get FileSystem from SparkSession fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) # Get Path class to convert string path to FS path path = spark._...

  • 7 kudos
9 More Replies
RateVan
by New Contributor II
  • 1296 Views
  • 3 replies
  • 0 kudos

Spark last window dont flush in append mode

The problem is very simple, when you use TUMBLING window with append mode, then the window is closed only when the next message arrives (+watermark logic). In the current implementation, if you stop incoming streaming data, the last window will NEVER...

3P1l3
  • 1296 Views
  • 3 replies
  • 0 kudos
Latest Reply
RateVan
New Contributor II
  • 0 kudos

No, the problem remains the same. The meaning doesn't change because you increased the timeout a little bit. As the window did not close, and does not close until a new message arrives

  • 0 kudos
2 More Replies
danniely
by New Contributor II
  • 2996 Views
  • 1 replies
  • 2 kudos

Pyspark RDD fails with pytest

when I call RDD Apis during pytest, it seems like module "serializer.py" cannot find any other modules under pyspark.I've already looked up on the internet, and it seems like pyspark modules are not properly importing other referring modules.I see ot...

  • 2996 Views
  • 1 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@hyunho lee​ : It sounds like you are encountering an issue with PySpark's serializer not being able to find the necessary modules during testing with Pytest. One solution you could try is to set the PYTHONPATH environment variable to include the pat...

  • 2 kudos
quakenbush
by Contributor
  • 2293 Views
  • 1 replies
  • 0 kudos

Is there something like Oracle's VPD-Feature in Databricks?

Since I am porting some code from Oracle to Databricks, I have another specific question.In Oracle there's something called Virtual Private Database, VPD. It's a simple security feature used to generate a WHERE-clause which the system will add to a u...

  • 2293 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Roger Bieri​ :In Databricks, you can use the UserDefinedFunction (UDF) feature to create a custom function that will be applied to a DataFrame. You can use this feature to add a WHERE clause to a DataFrame based on the user context. Here's an exampl...

  • 0 kudos
elgeo
by Valued Contributor II
  • 2699 Views
  • 2 replies
  • 0 kudos

Trasform SQL Cursor using Pyspark in Databricks

We have a Cursor in DB2 which reads in each loop data from 2 tables. At the end of each loop, after inserting the data to a target table, we update records related to each loop in these 2 tables before moving to the next loop. An indicative example i...

  • 2699 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @ELENI GEORGOUSI​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

  • 0 kudos
1 More Replies
Labels