cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Ancil
by Contributor II
  • 2850 Views
  • 3 replies
  • 1 kudos

Resolved! PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.

I have pandas_udf, its working for 1 rows, but I tried with more than one rows getting below error.PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output w...

  • 2850 Views
  • 3 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

I was testing, and your function is correct. So you need to have an error in inputData type (is all string) or with result_json. Please also check the runtime version. I was using 11 LTS. 

  • 1 kudos
2 More Replies
wim_schmitz_per
by New Contributor II
  • 3900 Views
  • 2 replies
  • 2 kudos

Transforming/Saving Python Class Instances to Delta Rows

I'm trying to reuse a Python Package to do a very complex series of parsing binary files into workable data in Delta Format. I have made the first part (binary file parsing) work with a UDF:asffileparser = F.udf(File()._parseBytes,AsfFileDelta.getSch...

  • 3900 Views
  • 2 replies
  • 2 kudos
Latest Reply
Debayan
Databricks Employee
  • 2 kudos

Hi, did you try to follow, "Fix it by registering a custom IObjectConstructor for this class."?Also, could you please provide us the full error?

  • 2 kudos
1 More Replies
Rishabh-Pandey
by Esteemed Contributor
  • 1135 Views
  • 1 replies
  • 5 kudos

PrivilegesSELECT: gives read access to an object.CREATE: gives ability to create an object (for example, a table in a schema).MODIFY: gives ability to...

PrivilegesSELECT: gives read access to an object.CREATE: gives ability to create an object (for example, a table in a schema).MODIFY: gives ability to add, delete, and modify data to or from an object.USAGE: does not give any abilities, but is an add...

  • 1135 Views
  • 1 replies
  • 5 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 5 kudos

thanks sir

  • 5 kudos
Ancil
by Contributor II
  • 16530 Views
  • 11 replies
  • 1 kudos

Any on please suggest how we can effectively loop through PySpark Dataframe .

Scenario: I Have a dataframe with more than 1000 rows, each row having a file path and result data column. I need to loop through each row and write files to the file path, with data from the result column.what is the easiest and time effective way ...

image
  • 16530 Views
  • 11 replies
  • 1 kudos
Latest Reply
NhatHoang
Valued Contributor II
  • 1 kudos

Hi,​I agree with Werners, try to avoid loop with Pyspark Dataframe.If your dataframe is small, as you said, only about 1000 rows, you may consider to use Pandas.Thanks.​

  • 1 kudos
10 More Replies
VinayEmmadi
by New Contributor
  • 661 Views
  • 0 replies
  • 0 kudos

%run not working as expected

I have a quick question about %run <notebook path>. I am using the %run command to import functions from a notebook. It works fine when I run %run once. But when I run two %run commands, I lose the reference from the first %run. I get NameError when ...

  • 661 Views
  • 0 replies
  • 0 kudos
BradSheridan
by Valued Contributor
  • 2225 Views
  • 1 replies
  • 0 kudos

using a UDF in a Windows function

I have created a UDF using:%sqlCREATE OR REPLACE FUNCTION f_timestamp_max()....And I've confirmed it works with:%sqlselect f_timestamp_max()But when I try to use it in a Window function (lead over partition), I get:AnalysisException: Using SQL functi...

  • 2225 Views
  • 1 replies
  • 0 kudos
Latest Reply
Debayan
Databricks Employee
  • 0 kudos

Hi, As of now, Spark SQL supports three kinds of window functions: ranking functions. analytic functions. aggregate functions. Please refer: https://docs.databricks.com/sql/language-manual/sql-ref-window-functions.html#parameters

  • 0 kudos
BradSheridan
by Valued Contributor
  • 1481 Views
  • 1 replies
  • 0 kudos

Resolved! Using a UDF in %sql?

Afternoon everyone! I logged in hoping to see some suggestions but think maybe I need to reword the question a little How can I create a UDF that converts '30000101' to timestamp and then use it in a query like below?%sqlselectfield1,field2,nvl(some...

  • 1481 Views
  • 1 replies
  • 0 kudos
Latest Reply
BradSheridan
Valued Contributor
  • 0 kudos

Got it working (but going to post a new question momentarily): I needed to use timestamp(date '3000-01-01) instead of to_timestamp

  • 0 kudos
NicolasEscobar
by New Contributor II
  • 9532 Views
  • 7 replies
  • 5 kudos

Resolved! Job fails after runtime upgrade

I have a job running with no issues in Databricks runtime 7.3 LTS. When I upgraded to 8.3 it fails with error An exception was thrown from a UDF: 'pyspark.serializers.SerializationError'... SparkContext should only be created and accessed on the driv...

  • 9532 Views
  • 7 replies
  • 5 kudos
Latest Reply
User16873042682
New Contributor II
  • 5 kudos

Adding to @Sean Owen​  comments, The only reason this is working is that the optimizer is evaluating this locally rather than creating a context on executors and evaluating it.

  • 5 kudos
6 More Replies
cuteabhi32
by New Contributor III
  • 42810 Views
  • 11 replies
  • 1 kudos

Resolved! Trying to check if a column exist in a dataframe or not if not then i have to give NULL if yes then i need to give the column itself by using UDF

from pyspark import SparkContextfrom pyspark import SparkConffrom pyspark.sql.types import *from pyspark.sql.functions import *from pyspark.sql import *from pyspark.sql.types import StringTypefrom pyspark.sql.functions import udfdf1 = spark.read.form...

  • 42810 Views
  • 11 replies
  • 1 kudos
Latest Reply
cuteabhi32
New Contributor III
  • 1 kudos

Thanks i modified my code as per your suggestion and it worked perfectly Thanks again for all your inputsdflist= spark.createDataFrame(list(a.columns), "string").toDF("Name")dfg=dflist.filter(col('name').isin('ref_date')).count()if dfg==1 :  a = a.wi...

  • 1 kudos
10 More Replies
RRO
by Contributor
  • 31902 Views
  • 6 replies
  • 7 kudos

Resolved! Performance for pyspark dataframe is very slow after using a @pandas_udf

Hello,I am currently working on a time series forecasting with FBProphet. Since I have data with many time series groups (~3000) I use a @pandas_udf to parallelize the training. @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def forecast_netprofit(pr...

  • 31902 Views
  • 6 replies
  • 7 kudos
Latest Reply
RRO
Contributor
  • 7 kudos

Thank you for the answers. Unfortunately this did not solve the performance issue.What I did now is I saved the results into a table:results.write.mode("overwrite").saveAsTable("db.results") This is probably not the best solution but after I do that ...

  • 7 kudos
5 More Replies
Constantine
by Contributor III
  • 2484 Views
  • 1 replies
  • 4 kudos

Resolved! How to process a large delta table with UDF ?

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another columnMy code is something like thisdef my_udf(data): return pass   udf_func = udf(my_udf, StringType()) data...

  • 2484 Views
  • 1 replies
  • 4 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 4 kudos

That udf code will run on driver so better not use it for such a big dataset. What you need is vectorized pandas udf https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html

  • 4 kudos
sarosh
by New Contributor
  • 8234 Views
  • 2 replies
  • 1 kudos

ModuleNotFoundError / SerializationError when executing over databricks-connect

I am running into the following error when I run a model fitting process over databricks-connect.It looks like worker nodes are unable to access modules from the project's parent directory. Note that the program runs successfully up to this point; n...

modulenotfoundanno
  • 8234 Views
  • 2 replies
  • 1 kudos
Latest Reply
Manjunath
Databricks Employee
  • 1 kudos

@Sarosh Ahmad​ , Could you try adding the zip of the module to the addPyFile like belowspark.sparkContext.addPyFile("src.zip")

  • 1 kudos
1 More Replies
krishnakash
by New Contributor II
  • 3998 Views
  • 4 replies
  • 4 kudos

Resolved! Is there any way of determining last stage of SparkSQL Application Execution?

I have created custom UDF's that generate logs. These logs can be flushed by calling another API exposed which is exposed by an internal layer. However I want to call this API just after the execution of the UDF comes to an end. Is there any way of d...

  • 3998 Views
  • 4 replies
  • 4 kudos
Latest Reply
User16763506586
Contributor
  • 4 kudos

@Krishna Kashiv​ May be ExecutorPlugin.java can help. It has all the methods you might required. Let me know if it works or not.You need to implement this interface org.apache.spark.api.plugin.SparkPluginand expose it as spark.plugins = com.abc.Imp...

  • 4 kudos
3 More Replies
User16826994223
by Honored Contributor III
  • 1574 Views
  • 1 replies
  • 0 kudos

Resolved! How do I register UDF in sql

Can I get an example of how do create UDF in python and use in sql

  • 1574 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

def squared(s): return s * s spark.udf.register("squaredWithPython", squared)You can optionally set the return type of your UDF. The default return type is StringType.from pyspark.sql.types import LongType def squared_typed(s): return s * s spark...

  • 0 kudos
Labels