cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

wim_schmitz_per
by New Contributor II
  • 1897 Views
  • 2 replies
  • 2 kudos

Transforming/Saving Python Class Instances to Delta Rows

I'm trying to reuse a Python Package to do a very complex series of parsing binary files into workable data in Delta Format. I have made the first part (binary file parsing) work with a UDF:asffileparser = F.udf(File()._parseBytes,AsfFileDelta.getSch...

  • 1897 Views
  • 2 replies
  • 2 kudos
Latest Reply
Debayan
Esteemed Contributor III
  • 2 kudos

Hi, did you try to follow, "Fix it by registering a custom IObjectConstructor for this class."?Also, could you please provide us the full error?

  • 2 kudos
1 More Replies
Rishabh264
by Honored Contributor II
  • 649 Views
  • 1 replies
  • 4 kudos

PrivilegesSELECT: gives read access to an object.CREATE: gives ability to create an object (for example, a table in a schema).MODIFY: gives ability to...

PrivilegesSELECT: gives read access to an object.CREATE: gives ability to create an object (for example, a table in a schema).MODIFY: gives ability to add, delete, and modify data to or from an object.USAGE: does not give any abilities, but is an add...

  • 649 Views
  • 1 replies
  • 4 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 4 kudos

thanks sir

  • 4 kudos
Ancil
by Contributor II
  • 9487 Views
  • 11 replies
  • 1 kudos

Any on please suggest how we can effectively loop through PySpark Dataframe .

Scenario: I Have a dataframe with more than 1000 rows, each row having a file path and result data column. I need to loop through each row and write files to the file path, with data from the result column.what is the easiest and time effective way ...

image
  • 9487 Views
  • 11 replies
  • 1 kudos
Latest Reply
NhatHoang
Valued Contributor II
  • 1 kudos

Hi,​I agree with Werners, try to avoid loop with Pyspark Dataframe.If your dataframe is small, as you said, only about 1000 rows, you may consider to use Pandas.Thanks.​

  • 1 kudos
10 More Replies
VinayEmmadi
by New Contributor
  • 377 Views
  • 0 replies
  • 0 kudos

%run not working as expected

I have a quick question about %run <notebook path>. I am using the %run command to import functions from a notebook. It works fine when I run %run once. But when I run two %run commands, I lose the reference from the first %run. I get NameError when ...

  • 377 Views
  • 0 replies
  • 0 kudos
BradSheridan
by Valued Contributor
  • 1383 Views
  • 1 replies
  • 0 kudos

using a UDF in a Windows function

I have created a UDF using:%sqlCREATE OR REPLACE FUNCTION f_timestamp_max()....And I've confirmed it works with:%sqlselect f_timestamp_max()But when I try to use it in a Window function (lead over partition), I get:AnalysisException: Using SQL functi...

  • 1383 Views
  • 1 replies
  • 0 kudos
Latest Reply
Debayan
Esteemed Contributor III
  • 0 kudos

Hi, As of now, Spark SQL supports three kinds of window functions: ranking functions. analytic functions. aggregate functions. Please refer: https://docs.databricks.com/sql/language-manual/sql-ref-window-functions.html#parameters

  • 0 kudos
BradSheridan
by Valued Contributor
  • 807 Views
  • 1 replies
  • 0 kudos

Resolved! Using a UDF in %sql?

Afternoon everyone! I logged in hoping to see some suggestions but think maybe I need to reword the question a little How can I create a UDF that converts '30000101' to timestamp and then use it in a query like below?%sqlselectfield1,field2,nvl(some...

  • 807 Views
  • 1 replies
  • 0 kudos
Latest Reply
BradSheridan
Valued Contributor
  • 0 kudos

Got it working (but going to post a new question momentarily): I needed to use timestamp(date '3000-01-01) instead of to_timestamp

  • 0 kudos
NicolasEscobar
by New Contributor II
  • 6617 Views
  • 8 replies
  • 5 kudos

Resolved! Job fails after runtime upgrade

I have a job running with no issues in Databricks runtime 7.3 LTS. When I upgraded to 8.3 it fails with error An exception was thrown from a UDF: 'pyspark.serializers.SerializationError'... SparkContext should only be created and accessed on the driv...

  • 6617 Views
  • 8 replies
  • 5 kudos
Latest Reply
User16873042682
New Contributor II
  • 5 kudos

Adding to @Sean Owen​  comments, The only reason this is working is that the optimizer is evaluating this locally rather than creating a context on executors and evaluating it.

  • 5 kudos
7 More Replies
cuteabhi32
by New Contributor III
  • 26248 Views
  • 11 replies
  • 1 kudos

Resolved! Trying to check if a column exist in a dataframe or not if not then i have to give NULL if yes then i need to give the column itself by using UDF

from pyspark import SparkContextfrom pyspark import SparkConffrom pyspark.sql.types import *from pyspark.sql.functions import *from pyspark.sql import *from pyspark.sql.types import StringTypefrom pyspark.sql.functions import udfdf1 = spark.read.form...

  • 26248 Views
  • 11 replies
  • 1 kudos
Latest Reply
cuteabhi32
New Contributor III
  • 1 kudos

Thanks i modified my code as per your suggestion and it worked perfectly Thanks again for all your inputsdflist= spark.createDataFrame(list(a.columns), "string").toDF("Name")dfg=dflist.filter(col('name').isin('ref_date')).count()if dfg==1 :  a = a.wi...

  • 1 kudos
10 More Replies
Constantine
by Contributor III
  • 6195 Views
  • 3 replies
  • 7 kudos

Resolved! collect_list by preserving order based on another variable - Spark SQL

I am using databricks sql notebook to run these queries. I have a Python UDF like   %python   from pyspark.sql.functions import udf from pyspark.sql.types import StringType, DoubleType, DateType   def get_sell_price(sale_prices): return sale_...

  • 6195 Views
  • 3 replies
  • 7 kudos
Latest Reply
Kaniz
Community Manager
  • 7 kudos

Hi @John Constantine​ , Just a friendly follow-up. Do you still need help or do the above responses help you find the solution? Please let us know.

  • 7 kudos
2 More Replies
sarosh
by New Contributor
  • 5433 Views
  • 3 replies
  • 1 kudos

ModuleNotFoundError / SerializationError when executing over databricks-connect

I am running into the following error when I run a model fitting process over databricks-connect.It looks like worker nodes are unable to access modules from the project's parent directory. Note that the program runs successfully up to this point; n...

modulenotfoundanno
  • 5433 Views
  • 3 replies
  • 1 kudos
Latest Reply
Kaniz
Community Manager
  • 1 kudos

Hi @Sarosh Ahmad​ , Just a friendly follow-up. Do you still need help or the above responses help you to find the solution? Please let us know.

  • 1 kudos
2 More Replies
RRO
by Contributor
  • 21669 Views
  • 7 replies
  • 7 kudos

Resolved! Performance for pyspark dataframe is very slow after using a @pandas_udf

Hello,I am currently working on a time series forecasting with FBProphet. Since I have data with many time series groups (~3000) I use a @pandas_udf to parallelize the training. @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def forecast_netprofit(pr...

  • 21669 Views
  • 7 replies
  • 7 kudos
Latest Reply
RRO
Contributor
  • 7 kudos

Thank you for the answers. Unfortunately this did not solve the performance issue.What I did now is I saved the results into a table:results.write.mode("overwrite").saveAsTable("db.results") This is probably not the best solution but after I do that ...

  • 7 kudos
6 More Replies
Constantine
by Contributor III
  • 1235 Views
  • 1 replies
  • 4 kudos

Resolved! How to process a large delta table with UDF ?

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another columnMy code is something like thisdef my_udf(data): return pass   udf_func = udf(my_udf, StringType()) data...

  • 1235 Views
  • 1 replies
  • 4 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 4 kudos

That udf code will run on driver so better not use it for such a big dataset. What you need is vectorized pandas udf https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html

  • 4 kudos
krishnakash
by New Contributor II
  • 2263 Views
  • 6 replies
  • 4 kudos

Resolved! Is there any way of determining last stage of SparkSQL Application Execution?

I have created custom UDF's that generate logs. These logs can be flushed by calling another API exposed which is exposed by an internal layer. However I want to call this API just after the execution of the UDF comes to an end. Is there any way of d...

  • 2263 Views
  • 6 replies
  • 4 kudos
Latest Reply
User16763506586
Contributor
  • 4 kudos

@Krishna Kashiv​ May be ExecutorPlugin.java can help. It has all the methods you might required. Let me know if it works or not.You need to implement this interface org.apache.spark.api.plugin.SparkPluginand expose it as spark.plugins = com.abc.Imp...

  • 4 kudos
5 More Replies
User16826994223
by Honored Contributor III
  • 916 Views
  • 1 replies
  • 0 kudos

Resolved! How do I register UDF in sql

Can I get an example of how do create UDF in python and use in sql

  • 916 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

def squared(s): return s * s spark.udf.register("squaredWithPython", squared)You can optionally set the return type of your UDF. The default return type is StringType.from pyspark.sql.types import LongType def squared_typed(s): return s * s spark...

  • 0 kudos
Labels