Data Engineering

Forum Posts

Sorted by:

by Constantine • Contributor III

11-04-2021 1:09:59 PM

11779 Views
3 replies
7 kudos

Resolved! collect_list by preserving order based on another variable - Spark SQL

I am using databricks sql notebook to run these queries. I have a Python UDF like %python from pyspark.sql.functions import udf from pyspark.sql.types import StringType, DoubleType, DateType def get_sell_price(sale_prices): return sale_...

Data Engineering

11779 Views
3 replies
7 kudos

11-04-2021 1:09:59 PM

View Replies

Latest Reply

villi77
New Contributor II

10-14-2024 11:22:34 AM

7 kudos

I had a similar situation where I was trying to order the days of the week from Monday to Sunday. I saw solutions that use Python but was wanting to do it all in SQL. My original attempt was to use: CONCAT_WS(',', COLLECT_LIST(DISTINCT t.LOAD_ORIG_...

7 kudos

10-14-2024 11:22:34 AM

2 More Replies

by famous_jt33 • New Contributor

06-16-2023 4:33:23 PM

1555 Views
2 replies
2 kudos

SQL UDFs for DLT pipelines

I am trying to implement a UDF for a DLT pipeline. I have seen the documentation stating that it is possible but I am getting an error after adding an SQL UDF to a cell in the notebook attached to the pipeline. The aim is to have the UDF in a separat...

Data Engineering

1555 Views
2 replies
2 kudos

06-16-2023 4:33:23 PM

View Replies

Latest Reply

6502
New Contributor III

12-19-2023 9:52:13 AM

2 kudos

You can't. The SQL support on DLT pipeline cluster is limited compared to a normal notebook. You can still define a UDF in Python using, of course, a Python notebook. In this case, you can use the spark.sql() function to execute your original SQL cod...

2 kudos

12-19-2023 9:52:13 AM

1 More Replies

by Christine • Contributor II

04-26-2023 3:25:27 AM

28403 Views
4 replies
1 kudos

Resolved! Is it possible to import functions from a module in Workspace/Shared instead of Repos?

Hi,I am considering creating libraries for my databricks notebooks, and found that it is possible to import functions from modules saved in repos. Is it possible to move the .py files with the functions to Workspace/Shared and still import functions ...

Data Engineering

28403 Views
4 replies
1 kudos

04-26-2023 3:25:27 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-30-2023 11:45:20 PM

1 kudos

Hi @Christine Pedersen Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell ...

1 kudos

04-30-2023 11:45:20 PM

3 More Replies

by KarthikeyanB • New Contributor II

06-14-2023 8:01:59 AM

2476 Views
1 replies
2 kudos

Window function + Multiple simultaneous aggregations

Hi team,Why is there no support to perform multiple aggregations together with a single window spec? ie I dont want to specify each aggregation separately and I don't want to see each aggregation perform as a separate piece of work.Or if there is ind...

Data Engineering

2476 Views
1 replies
2 kudos

06-14-2023 8:01:59 AM

View Replies

Latest Reply

KarthikeyanB
New Contributor II

06-15-2023 3:00:23 AM

2 kudos

Hi @Kaniz Fatma ,Firstly, thank you much for responding.Thank you for confirming that performing multiple aggr using a single window spec does NOT evaluate the window spec separately each time. My bad in the wrong understanding prior.

2 kudos

06-15-2023 3:00:23 AM

by Bbren • New Contributor

05-11-2023 4:51:33 AM

3040 Views
2 replies
1 kudos

Resolved! Handling of millions of xml in json files

Hi all, i have some questions related to the handling of many smalls files and possible improvements and augmentations. We have many small xml files. These files are previously processed by another system that puts them in our datalake, but as an add...

Data Engineering

3040 Views
2 replies
1 kudos

05-11-2023 4:51:33 AM

View Replies

Latest Reply

Anonymous
Not applicable

05-21-2023 11:56:20 PM

1 kudos

Hi @Bauke Brenninkmeijer Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best ...

1 kudos

05-21-2023 11:56:20 PM

1 More Replies

by Tsar • New Contributor III

03-16-2022 1:27:26 PM

12130 Views
10 replies
12 kudos

Limitations with UDFs wrapping modules imported via Repos files?

We have been importing custom module wheel files from our AzDevOps repository. We are pushing to use the Databricks Repos arbitrary files to simplify this but it is breaking our spark UDF that wraps one of the functions in the library with a ModuleNo...

Data Engineering

12130 Views
10 replies
12 kudos

03-16-2022 1:27:26 PM

View Replies

Latest Reply

Scott_B
New Contributor III

08-30-2022 11:45:08 AM

12 kudos

If your notebook is in the same Repo as the module, this should work without any modifications to the sys path.If your notebook is not in the same Repo as the module, you may need to ensure that the sys path is correct on all nodes in your cluster th...

12 kudos

08-30-2022 11:45:08 AM

9 More Replies

by Hubert-Dudek • Esteemed Contributor III

04-24-2023 7:24:48 AM

2855 Views
2 replies
9 kudos

databricks Photon is a next-generation engine on the Databricks Lakehouse Platform that provides speedy query performance at a low cost.- Its function...

databricks Photon is a next-generation engine on the Databricks Lakehouse Platform that provides speedy query performance at a low cost.- Its function coverage is growing, and UDF under Photon is coming, which can bring significant improvements in us...

Data Engineering

2855 Views
2 replies
9 kudos

04-24-2023 7:24:48 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

04-25-2023 4:29:14 AM

9 kudos

9 kudos

04-25-2023 4:29:14 AM

1 More Replies

by RichardDriven • New Contributor III

04-19-2023 7:39:02 PM

8150 Views
2 replies
1 kudos

How to apply a UDF to a property in an array of structs

I have a column that contains an array of structs as follows:"column" : [ { "struct_field1": "struct_value", "struct_field2": "struct_value" }, { "struct_field1": "struct_value", "struct_field2": "struct_value" } ]I want to apply a udf to each f...

Data Engineering

8150 Views
2 replies
1 kudos

04-19-2023 7:39:02 PM

View Replies

by Pawan1 • New Contributor II

08-25-2022 3:38:34 AM

1839 Views
1 replies
2 kudos

Your administrator has forbidden Scala UDFs from being run on this cluster. How to enable access to Scala UDF on Azure Databricks cluster ?

Hi All,When i try to run a scala UDF in Azuredatabricks 10.1 (includes Apache Spark 3.2.0, Scala 2.12) cluster i was able to run the udf. However when i tried to run the same notebook in 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12) cluster i ha...

Data Engineering

1839 Views
1 replies
2 kudos

08-25-2022 3:38:34 AM

View Replies

Latest Reply

Debayan
Databricks Employee

04-24-2023 8:07:40 AM

2 kudos

Hi, Are you trying this with High concurrency clusters? Also, please tag @Debayan Mukherjee with your next response so that I will get notified.

2 kudos

04-24-2023 8:07:40 AM

by tytytyc26 • New Contributor II

04-02-2023 8:59:59 PM

2588 Views
3 replies
0 kudos

Resolved! Problem with accessing element using Pandas UDF in Image Processing

Hi everyone,I was stuck at this for very long time. Not a very familiar user of using Spark for image processing. I was trying to resize images that are loaded into a Spark DF. However, it keeps throwing error that I am not able to access the element...

Data Engineering

2588 Views
3 replies
0 kudos

04-02-2023 8:59:59 PM

View Replies

Latest Reply

Anonymous
Not applicable

04-17-2023 6:48:04 AM

0 kudos

@Yan Chong Tan :The error you are facing is due to the fact that you are trying to access the attribute "width" of a string object in the resize_image function. Specifically, input_dim is a string object, but you are trying to access its width attr...

0 kudos

04-17-2023 6:48:04 AM

2 More Replies

by sanjay • Valued Contributor II

03-13-2023 10:45:11 AM

8874 Views
3 replies
5 kudos

Resolved! PySpark UDF is taking long to process

Hi,I have UDF which runs for each spark dataframe row, does some complex processing and return string output. But it takes very long if data is 15000 rows. I have configured cluster with autoscaling, but its not spinning more servers.Please suggest h...

Data Engineering

8874 Views
3 replies
5 kudos

03-13-2023 10:45:11 AM

View Replies

Latest Reply

Lakshay
Databricks Employee

03-14-2023 5:00:05 AM

5 kudos

Hi @Sanjay Jain , Python UDFs are generally slower to process because it runs mostly in the driver which can also lead to OOM errors on Driver. To resolve this issue, please consider the below:Use spark built-in functions to do the same functionalit...

5 kudos

03-14-2023 5:00:05 AM

2 More Replies

by MikeJohnsonZa • New Contributor

02-02-2023 1:05:49 AM

2629 Views
3 replies
0 kudos

Resolved! Importing irregularly formatted json files

HiI'm importing a large collection of json files, the problem is that they are not what I would expect a well-formatted json file to be (although probably still valid), each file consists of only a single record that looks something like this (this i...

Data Engineering

2629 Views
3 replies
0 kudos

02-02-2023 1:05:49 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

03-01-2023 10:37:07 AM

0 kudos

Hi @Michael Johnson,I would like to share the following notebook which contains examples on how to process complex data types, like JSON. Please check the following link and let us know if you still need help https://docs.databricks.com/optimization...

0 kudos

03-01-2023 10:37:07 AM

2 More Replies

by Johan_Van_Noten • New Contributor III

12-06-2021 4:38:56 AM

17135 Views
19 replies
10 kudos

Resolved! Correlated column exception in SQL UDF when using UDF parameters.

EnvironmentAzure Databricks 10.1, including Spark 3.2.0ScenarioI want to retrieve the average of a series of values between two timestamps, using a SQL UDF.The average is obviously just an example. In a real scenario, I would like to hide some additi...

Data Engineering

17135 Views
19 replies
10 kudos

12-06-2021 4:38:56 AM

View Replies

Latest Reply

creastysomp
New Contributor II

01-27-2023 2:52:23 AM

10 kudos

Thanks for your suggestion. The fact that I want to do this in SparkSQL is because there is no underlying SQLServer.

10 kudos

01-27-2023 2:52:23 AM

18 More Replies

by Ancil • Contributor II

01-18-2023 10:46:57 AM

1713 Views
1 replies
1 kudos

PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.

I have pandas_udf, its working for 4 rows, but I tried with more than 4 rows getting below error.PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was...

Data Engineering

1713 Views
1 replies
1 kudos

01-18-2023 10:46:57 AM

View Replies

Latest Reply

Ancil
Contributor II

01-22-2023 5:33:17 PM

1 kudos

@Kaniz Fatma Can you please help me on pandas_udf ?Above scenario I have used regular expressions, for that we have our spark method, but I have other pandas_udf have same issue.

1 kudos

01-22-2023 5:33:17 PM

by Gim • Contributor

01-18-2023 6:31:02 AM

4734 Views
2 replies
1 kudos

Resolved! How to use SQL UDFs for Delta Live Table pipelines?

I've been searching for a way to use a SQL UDF for our DLT pipeline. In this case it is to convert a time duration string into INT seconds. How exactly do we use/apply UDFs in this case?

Data Engineering

4734 Views
2 replies
1 kudos

01-18-2023 6:31:02 AM

View Replies

Latest Reply

daniel_sahal
Esteemed Contributor

01-18-2023 11:02:15 PM

1 kudos

@GimYou can create Python UDF and then use it in SQL.https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-cookbook.html#use-python-udfs-in-sql

1 kudos

01-18-2023 11:02:15 PM

1 More Replies