Data Engineering

Forum Posts

Sorted by:

by pjp94 • Contributor

02-22-2024 2:37:11 PM

340 Views
1 replies
0 kudos

pyspark.pandas PandasNotImplementedError

Can someone explain why this below code is throwing an error? My intuition is telling me it's my spark version (3.2.1) but would like confirmation:d = {'key':['a','a','c','d','e','f','g','h'], 'data':[1,2,3,4,5,6,7,8]} x = ps.DataFrame(d) x[x['...

Data Engineering

340 Views
1 replies
0 kudos

02-22-2024 2:37:11 PM

View Replies

Latest Reply

shan_chandra
Honored Contributor III

02-27-2024 9:47:22 AM

0 kudos

@pjp94 - The error indicates the pandas pyspark implementation does not have the below method implemented. pd.Series.duplicated() Next steps is to use dataframe methods such as distinct, groupBy, dropDuplicates to resolve this.

0 kudos

02-27-2024 9:47:22 AM

by User_1611 • New Contributor

02-26-2024 10:05:27 AM

515 Views
1 replies
0 kudos

TimeoutException: Stream Execution thread for stream [xxxxxx]failed to stop within 15000 millisecond

TimeoutException: Stream Execution thread for stream [id = xxx runId = xxxx] failed to stop within 15000 milliseconds (specified by spark.sql.streaming.stopTimeout). See the cause on what was being executed in the streaming query thread.I have a data...

Data Engineering

515 Views
1 replies
0 kudos

02-26-2024 10:05:27 AM

View Replies

Latest Reply

shan_chandra
Honored Contributor III

02-27-2024 9:05:50 AM

0 kudos

@User_1611 - could you please try the following ? Reduce the number of streaming queries running on the same clusterMake sure your code does not try to re-trigger/start an active streaming queryMake sure to collect the thread dumps if this error hap...

0 kudos

02-27-2024 9:05:50 AM

by Shan1 • New Contributor II

02-26-2024 11:34:40 AM

934 Views
5 replies
0 kudos

Read large volume of parquet files

I have 50k + parquet files in the in azure datalake and i have mount point as well. I need to read all the files and load into a dataframe. i have around 2 billion records in total and all the files are not having all the columns, column order may di...

Data Engineering

934 Views
5 replies
0 kudos

02-26-2024 11:34:40 AM

View Replies

Latest Reply

shan_chandra
Honored Contributor III

02-27-2024 8:37:26 AM

0 kudos

@Shan1 - This could be due to the files have cols that differ by data type. Eg. Integer vs long , Boolean vs integer. can be resolved by schemaMerge=False. Please refer to this code. https://github.com/apache/spark/blob/418bba5ad6053449a141f3c9c31e...

0 kudos

02-27-2024 8:37:26 AM

4 More Replies

by Chandraw • New Contributor III

02-19-2024 3:21:01 AM

1041 Views
3 replies
1 kudos

Resolved! Malformed Input Exception while saving or retreiving Table

Hi everyone,I am using DBR version 13 and Managed tables in a custom catalog location of table is AWS S3.running notebook on single user clusterI am facing MalformedInputException while saving data to Tables or reading it.When I am running my noteboo...

Data Engineering

1041 Views
3 replies
1 kudos

02-19-2024 3:21:01 AM

View Replies

Latest Reply

Chandraw
New Contributor III

02-27-2024 4:15:42 AM

1 kudos

@Kaniz The issue is resolved as soon as I deployed it to mutlinode dev cluster.Issue is only occurring in single user clusters. Looks like limitation of running all updates in one node as distributed system.

1 kudos

02-27-2024 4:15:42 AM

2 More Replies

by BerkerKozan • New Contributor III

02-26-2024 12:34:26 PM

441 Views
2 replies
1 kudos

Creating All Purpose Cluster in Data Asset Bundles

There is no resource to create All Purpose Cluster, but I need it, so does it mean I should create it via Terraform or DBX and reference to it, which I dont prefer?

Data Engineering

441 Views
2 replies
1 kudos

02-26-2024 12:34:26 PM

View Replies

Latest Reply

BerkerKozan
New Contributor III

02-27-2024 2:55:37 AM

1 kudos

Hello @Ayushi_Suthar, Thanks for the quick reply! Where can I see these requests?https://ideas.databricks.com/ideas/DB-I-9451 ?

1 kudos

02-27-2024 2:55:37 AM

1 More Replies

by Andriy • New Contributor II

02-06-2024 7:49:51 AM

538 Views
3 replies
1 kudos

Get Job Run Status

Is there a way to get a child Job Run status and show the result within the parent notebook execution?Here is the case: I have a master notebook and several child notebooks. As a result, I want to see which notebook failed: For example Notebook job s...

Data Engineering

538 Views
3 replies
1 kudos

02-06-2024 7:49:51 AM

View Replies

Latest Reply

Kaniz
Community Manager

02-12-2024 8:55:43 AM

1 kudos

Hey there! Thanks a bunch for being part of our awesome community! We love having you around and appreciate all your questions. Take a moment to check out the responses – you'll find some great info. Your input is valuable, so pick the best solution...

1 kudos

02-12-2024 8:55:43 AM

2 More Replies

by anupam676 • New Contributor II

02-26-2024 10:50:59 AM

731 Views
2 replies
1 kudos

Resolved! How can I enable disk cache in this scenario/

I have a notebook where I read multiple tables from delta lake (let say schema is db) and after that I did some sort of transformation (image enclosed) using all these tables lwith transformations like join,filter etc. After transformation and writin...

Data Engineering

731 Views
2 replies
1 kudos

02-26-2024 10:50:59 AM

View Replies

Latest Reply

anupam676
New Contributor II

02-27-2024 12:11:46 AM

1 kudos

Thank you @shan_chandra

1 kudos

02-27-2024 12:11:46 AM

1 More Replies

by vroste • New Contributor III

09-07-2023 9:16:40 AM

919 Views
2 replies
0 kudos

Delta live tables running count output mode?

I have a DLT with a table that I want to contain the running aggregation (for the sake of simplicitly let's assume it's a count) for each value of some key column, using a session window. The input table goes back several years and to clean up aggreg...

Data Engineering

919 Views
2 replies
0 kudos

09-07-2023 9:16:40 AM

View Replies

Latest Reply

Kaniz
Community Manager

09-11-2023 4:17:02 AM

0 kudos

Hi @vroste , • To configure the update output mode for running aggregation in Delta Live Tables (DLT), use the outputMode option when writing the DLT table.• By default, DLT writes data in complete mode, which outputs the complete result table after...

0 kudos

09-11-2023 4:17:02 AM

1 More Replies

by luisvasv • New Contributor II

06-16-2023 9:14:16 AM

4391 Views
5 replies
2 kudos

Init script problems | workspace location

At this moment, I'm working on removing Legacy global and cluster-named init scripts due, it will be disabled for all workspaces on 01 Sept.At this moment, I'm facing a strange problem regarding moving init scripts from dbfs to the Workspace location...

Data Engineering

4391 Views
5 replies
2 kudos

06-16-2023 9:14:16 AM

View Replies

Latest Reply

DE-cat
New Contributor III

02-26-2024 12:31:53 PM

2 kudos

Using the new CLI v0.214, uploading ".sh" file works fine.`databricks workspace import --overwrite --format AUTO --file init_setup /init/user/job/init_setup`

2 kudos

02-26-2024 12:31:53 PM

4 More Replies

by Gauthy1825 • New Contributor II

02-15-2023 4:59:32 AM

2975 Views
9 replies
3 kudos

How to write to Salesforce from Databricks using the spark salesforce library

Hi, Im facing an issue while writing to Salesforce sandbox from Databricks. I have installed the "spark-salesforce_2.12-1.1.4" library and my code is as follows:-df_newLeads.write\ .format("com.springml.spark.salesforce")\ .option("username...

Data Engineering

2975 Views
9 replies
3 kudos

02-15-2023 4:59:32 AM

View Replies

Latest Reply

addy
New Contributor III

02-26-2024 10:22:40 AM

3 kudos

I made a function that used the code below and returned url, connectionProperties, sfwriteurl ="https://login.salesforce.com/"dom = url.split('//')[1].split('.')[0]session_id, instance = SalesforceLogin(username=connectionProperties['name'], password...

3 kudos

02-26-2024 10:22:40 AM

8 More Replies

by Heisenberg • New Contributor II

02-23-2024 10:18:36 AM

705 Views
2 replies
1 kudos

Migrate a workspace from one AWS account to another AWS account

Hi everyone,We have a Databricks workspace in an AWS account that we need to migrate to a new AWS account.The workspace has a lot of managed tables, workflows, saved queries, notebooks which need to be migrated, so looking for an efficient approach t...

Data Engineering

AWS

Databricks Migration

migration

queries

Workflows

705 Views
2 replies
1 kudos

02-23-2024 10:18:36 AM

View Replies

Latest Reply

katherine561
New Contributor II

02-25-2024 10:58:00 PM

1 kudos

For a streamlined migration of your Databricks workspace from one AWS account to another, start by exporting notebook, workflow, and saved query configurations using Databricks REST API or CLI. Employ Deep Clone or Delta Sharing for managed table dat...

1 kudos

02-25-2024 10:58:00 PM

1 More Replies

by LukeH_DE • New Contributor II

02-23-2024 4:51:08 AM

935 Views
2 replies
2 kudos

Resolved! Variable referencing in EXECUTE IMMEDIATE

Hi all,As part of an on-going exercise to refactor existing T-SQL code into Databricks, we've stumbled into an issue that we can't seem to overcome through Spark SQL.Currently we use dynamic SQL to loop through a number of tables, where we use parame...

Data Engineering

sql

Variables

935 Views
2 replies
2 kudos

02-23-2024 4:51:08 AM

View Replies

Latest Reply

SergeRielau
Valued Contributor

02-26-2024 7:54:04 AM

2 kudos

DECLARE OR REPLACE varfield_names1 STRING; SET VAR varfield_names1 = 'field1 STRING'; DECLARE OR REPLACE varsqlstring1 STRING; SET VAR varsqlstring1 = 'CREATE TABLE table1 (PrimaryKey STRING, Table STRING, ' || varfield_names1 || ')'; EXECUTE IMMEDI...

2 kudos

02-26-2024 7:54:04 AM

1 More Replies

by ksamborn • New Contributor II

02-23-2024 5:19:52 AM

823 Views
2 replies
0 kudos

withColumnRenamed error on Unity Catalog 14.3 LTS

Hi - We are migrating to Unity Catalog 14.3 LTS and have seen a change in behavior using withColumnRenamed.There is an error COLUMN_ALREADY_EXISTS on the join key, even though the column being renamed is a different column. The joined DataFrame do...

Data Engineering

Data Lineage

Unity Catalog

823 Views
2 replies
0 kudos

02-23-2024 5:19:52 AM

View Replies

Latest Reply

Palash01
Contributor III

02-25-2024 12:29:50 AM

0 kudos

Hey @ksamborn I can think of 2 solutions:Rename the column in df_2 before joining: df_1_alias = df_1.alias("t1") df_2_alias = df_2.alias("t2") join_df = df_1_alias.join(df_2_alias, df_1_alias.key == df_2_alias.key) rename_df = join_df.withColumnRenam...

0 kudos

02-25-2024 12:29:50 AM

1 More Replies

by RabahO • New Contributor II

02-21-2024 7:38:08 AM

640 Views
2 replies
0 kudos

Resolved! Unit tests in notebook not working

Hello, I'm trying to setup a notebook for tests or data quality checks. The name is not important.I basically read a table (the ETL output process - actual data).Then I read another table and do the calculation in the notebook (expected data)I'm stuc...

Data Engineering

640 Views
2 replies
0 kudos

02-21-2024 7:38:08 AM

View Replies

Latest Reply

RabahO
New Contributor II

02-26-2024 12:39:15 AM

0 kudos

thank you for the nutter. Tried it and it seems to answer my problematic.

0 kudos

02-26-2024 12:39:15 AM

1 More Replies

by Avinash_Narala • New Contributor III

02-25-2024 9:28:04 PM

454 Views
1 replies
0 kudos

Unable to use SQL UDF

Hello, I want to create an sql udf as follows:%sqlCREATE or replace FUNCTION get_type(s STRING) RETURNS STRING LANGUAGE PYTHON AS $$ def get_type(table_name): from pyspark.sql.functions import col from pyspark.sql import SparkSession ...

Data Engineering

sqludf

454 Views
1 replies
0 kudos

02-25-2024 9:28:04 PM

View Replies

Latest Reply

Kaniz
Community Manager

02-25-2024 10:31:53 PM

0 kudos

Hi @Avinash_Narala, The error message indicates that the execution of your user-defined function (UDF) get_type failed. This could be due to a variety of reasons. Here are a few things you could check: Data Type Mismatch: Ensure that the data ty...

0 kudos

02-25-2024 10:31:53 PM

User

Count

1602

736

344

284

247

Databricks

Forum Posts

pyspark.pandas PandasNotImplementedError

TimeoutException: Stream Execution thread for stream [xxxxxx]failed to stop within 15000 millisecond

Read large volume of parquet files

Resolved! Malformed Input Exception while saving or retreiving Table

Creating All Purpose Cluster in Data Asset Bundles

Get Job Run Status

Resolved! How can I enable disk cache in this scenario/

Delta live tables running count output mode?

Init script problems | workspace location

How to write to Salesforce from Databricks using the spark salesforce library

Migrate a workspace from one AWS account to another AWS account

Resolved! Variable referencing in EXECUTE IMMEDIATE

withColumnRenamed error on Unity Catalog 14.3 LTS

Resolved! Unit tests in notebook not working

Unable to use SQL UDF

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...