Topics with Label: Udf

Forum Posts

Sorted by:

by wim_schmitz_per • New Contributor II

01-03-2023 9:53:04 AM

1897 Views
2 replies
2 kudos

Transforming/Saving Python Class Instances to Delta Rows

I'm trying to reuse a Python Package to do a very complex series of parsing binary files into workable data in Delta Format. I have made the first part (binary file parsing) work with a UDF:asffileparser = F.udf(File()._parseBytes,AsfFileDelta.getSch...

Data Engineering

1897 Views
2 replies
2 kudos

01-03-2023 9:53:04 AM

View Replies

Latest Reply

Debayan
Esteemed Contributor III

01-05-2023 9:45:32 PM

2 kudos

Hi, did you try to follow, "Fix it by registering a custom IObjectConstructor for this class."?Also, could you please provide us the full error?

2 kudos

01-05-2023 9:45:32 PM

1 More Replies

by Rishabh264 • Honored Contributor II

12-27-2022 3:42:22 AM

649 Views
1 replies
4 kudos

PrivilegesSELECT: gives read access to an object.CREATE: gives ability to create an object (for example, a table in a schema).MODIFY: gives ability to...

PrivilegesSELECT: gives read access to an object.CREATE: gives ability to create an object (for example, a table in a schema).MODIFY: gives ability to add, delete, and modify data to or from an object.USAGE: does not give any abilities, but is an add...

Data Engineering

649 Views
1 replies
4 kudos

12-27-2022 3:42:22 AM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

12-27-2022 8:40:33 PM

4 kudos

thanks sir

4 kudos

12-27-2022 8:40:33 PM

by Ancil • Contributor II

12-01-2022 4:59:35 AM

9487 Views
11 replies
1 kudos

Any on please suggest how we can effectively loop through PySpark Dataframe .

Scenario: I Have a dataframe with more than 1000 rows, each row having a file path and result data column. I need to loop through each row and write files to the file path, with data from the result column.what is the easiest and time effective way ...

Data Engineering

9487 Views
11 replies
1 kudos

12-01-2022 4:59:35 AM

View Replies

Latest Reply

NhatHoang
Valued Contributor II

12-01-2022 7:28:07 PM

1 kudos

Hi,I agree with Werners, try to avoid loop with Pyspark Dataframe.If your dataframe is small, as you said, only about 1000 rows, you may consider to use Pandas.Thanks.

1 kudos

12-01-2022 7:28:07 PM

10 More Replies

by VinayEmmadi • New Contributor

10-03-2022 7:41:35 AM

377 Views
0 replies
0 kudos

%run not working as expected

I have a quick question about %run <notebook path>. I am using the %run command to import functions from a notebook. It works fine when I run %run once. But when I run two %run commands, I lose the reference from the first %run. I get NameError when ...

Data Engineering

377 Views
0 replies
0 kudos

10-03-2022 7:41:35 AM

by BradSheridan • Valued Contributor

09-13-2022 8:48:11 AM

1383 Views
1 replies
0 kudos

using a UDF in a Windows function

I have created a UDF using:%sqlCREATE OR REPLACE FUNCTION f_timestamp_max()....And I've confirmed it works with:%sqlselect f_timestamp_max()But when I try to use it in a Window function (lead over partition), I get:AnalysisException: Using SQL functi...

Data Engineering

1383 Views
1 replies
0 kudos

09-13-2022 8:48:11 AM

View Replies

Latest Reply

Debayan
Esteemed Contributor III

09-15-2022 11:14:32 PM

0 kudos

Hi, As of now, Spark SQL supports three kinds of window functions: ranking functions. analytic functions. aggregate functions. Please refer: https://docs.databricks.com/sql/language-manual/sql-ref-window-functions.html#parameters

0 kudos

09-15-2022 11:14:32 PM

by BradSheridan • Valued Contributor

08-19-2022 6:38:38 AM

807 Views
1 replies
0 kudos

Resolved! Using a UDF in %sql?

Afternoon everyone! I logged in hoping to see some suggestions but think maybe I need to reword the question a little How can I create a UDF that converts '30000101' to timestamp and then use it in a query like below?%sqlselectfield1,field2,nvl(some...

Data Engineering

807 Views
1 replies
0 kudos

08-19-2022 6:38:38 AM

View Replies

Latest Reply

BradSheridan
Valued Contributor

09-13-2022 8:43:02 AM

0 kudos

Got it working (but going to post a new question momentarily): I needed to use timestamp(date '3000-01-01) instead of to_timestamp

0 kudos

09-13-2022 8:43:02 AM

by matty_f • New Contributor II

08-03-2022 1:43:13 PM

549 Views
0 replies
0 kudos

Help integrating streaming pipeline with AWS services S3 and Lambda

Hi there, I am trying to build a delta live tables pipeline that ingests gzip compressed archives as they're uploaded to S3. The archives contain 2 files in a proprietary format, and one is needed to determine how to parse the other. Once the file co...

Data Engineering

549 Views
0 replies
0 kudos

08-03-2022 1:43:13 PM

by NicolasEscobar • New Contributor II

09-30-2021 3:54:36 AM

6617 Views
8 replies
5 kudos

Resolved! Job fails after runtime upgrade

I have a job running with no issues in Databricks runtime 7.3 LTS. When I upgraded to 8.3 it fails with error An exception was thrown from a UDF: 'pyspark.serializers.SerializationError'... SparkContext should only be created and accessed on the driv...

Data Engineering

6617 Views
8 replies
5 kudos

09-30-2021 3:54:36 AM

View Replies

Latest Reply

User16873042682
New Contributor II

03-01-2022 3:33:43 AM

5 kudos

Adding to @Sean Owen comments, The only reason this is working is that the optimizer is evaluating this locally rather than creating a context on executors and evaluating it.

5 kudos

03-01-2022 3:33:43 AM

7 More Replies

by cuteabhi32 • New Contributor III

06-06-2022 8:17:54 AM

26248 Views
11 replies
1 kudos

Resolved! Trying to check if a column exist in a dataframe or not if not then i have to give NULL if yes then i need to give the column itself by using UDF

from pyspark import SparkContextfrom pyspark import SparkConffrom pyspark.sql.types import *from pyspark.sql.functions import *from pyspark.sql import *from pyspark.sql.types import StringTypefrom pyspark.sql.functions import udfdf1 = spark.read.form...

Data Engineering

26248 Views
11 replies
1 kudos

06-06-2022 8:17:54 AM

View Replies

Latest Reply

cuteabhi32
New Contributor III

06-07-2022 7:29:16 AM

1 kudos

Thanks i modified my code as per your suggestion and it worked perfectly Thanks again for all your inputsdflist= spark.createDataFrame(list(a.columns), "string").toDF("Name")dfg=dflist.filter(col('name').isin('ref_date')).count()if dfg==1 : a = a.wi...

1 kudos

06-07-2022 7:29:16 AM

10 More Replies

by Constantine • Contributor III

11-04-2021 1:09:59 PM

6195 Views
3 replies
7 kudos

Resolved! collect_list by preserving order based on another variable - Spark SQL

I am using databricks sql notebook to run these queries. I have a Python UDF like %python from pyspark.sql.functions import udf from pyspark.sql.types import StringType, DoubleType, DateType def get_sell_price(sale_prices): return sale_...

Data Engineering

6195 Views
3 replies
7 kudos

11-04-2021 1:09:59 PM

View Replies

Latest Reply

Kaniz
Community Manager

05-23-2022 7:41:56 AM

7 kudos

Hi @John Constantine , Just a friendly follow-up. Do you still need help or do the above responses help you find the solution? Please let us know.

7 kudos

05-23-2022 7:41:56 AM

2 More Replies

by sarosh • New Contributor

09-27-2021 1:36:38 PM

5433 Views
3 replies
1 kudos

ModuleNotFoundError / SerializationError when executing over databricks-connect

I am running into the following error when I run a model fitting process over databricks-connect.It looks like worker nodes are unable to access modules from the project's parent directory. Note that the program runs successfully up to this point; n...

Data Engineering

5433 Views
3 replies
1 kudos

09-27-2021 1:36:38 PM

View Replies

Latest Reply

Kaniz
Community Manager

05-18-2022 1:02:12 AM

1 kudos

Hi @Sarosh Ahmad , Just a friendly follow-up. Do you still need help or the above responses help you to find the solution? Please let us know.

1 kudos

05-18-2022 1:02:12 AM

2 More Replies

by RRO • Contributor

03-31-2022 3:12:14 AM

21669 Views
7 replies
7 kudos

Resolved! Performance for pyspark dataframe is very slow after using a @pandas_udf

Hello,I am currently working on a time series forecasting with FBProphet. Since I have data with many time series groups (~3000) I use a @pandas_udf to parallelize the training. @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def forecast_netprofit(pr...

Data Engineering

21669 Views
7 replies
7 kudos

03-31-2022 3:12:14 AM

View Replies

Latest Reply

RRO
Contributor

04-12-2022 8:01:24 AM

7 kudos

Thank you for the answers. Unfortunately this did not solve the performance issue.What I did now is I saved the results into a table:results.write.mode("overwrite").saveAsTable("db.results") This is probably not the best solution but after I do that ...

7 kudos

04-12-2022 8:01:24 AM

6 More Replies

by Constantine • Contributor III

03-24-2022 8:39:56 AM

1235 Views
1 replies
4 kudos

Resolved! How to process a large delta table with UDF ?

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another columnMy code is something like thisdef my_udf(data): return pass udf_func = udf(my_udf, StringType()) data...

Data Engineering

1235 Views
1 replies
4 kudos

03-24-2022 8:39:56 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

03-24-2022 8:52:10 AM

4 kudos

That udf code will run on driver so better not use it for such a big dataset. What you need is vectorized pandas udf https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html

4 kudos

03-24-2022 8:52:10 AM

by krishnakash • New Contributor II

10-04-2021 5:49:21 AM

2263 Views
6 replies
4 kudos

Resolved! Is there any way of determining last stage of SparkSQL Application Execution?

I have created custom UDF's that generate logs. These logs can be flushed by calling another API exposed which is exposed by an internal layer. However I want to call this API just after the execution of the UDF comes to an end. Is there any way of d...

Data Engineering

2263 Views
6 replies
4 kudos

10-04-2021 5:49:21 AM

View Replies

Latest Reply

User16763506586
Contributor

10-13-2021 5:16:09 AM

4 kudos

@Krishna Kashiv May be ExecutorPlugin.java can help. It has all the methods you might required. Let me know if it works or not.You need to implement this interface org.apache.spark.api.plugin.SparkPluginand expose it as spark.plugins = com.abc.Imp...

4 kudos

10-13-2021 5:16:09 AM

5 More Replies

by User16826994223 • Honored Contributor III

06-22-2021 5:36:42 AM

916 Views
1 replies
0 kudos

Resolved! How do I register UDF in sql

Can I get an example of how do create UDF in python and use in sql

Data Engineering

916 Views
1 replies
0 kudos

06-22-2021 5:36:42 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-22-2021 5:41:16 AM

0 kudos

def squared(s): return s * s spark.udf.register("squaredWithPython", squared)You can optionally set the return type of your UDF. The default return type is StringType.from pyspark.sql.types import LongType def squared_typed(s): return s * s spark...

0 kudos

06-22-2021 5:41:16 AM