Data Engineering

Forum Posts

Sorted by:

by najmead • Contributor

01-19-2023 3:35:10 PM

1107 Views
2 replies
2 kudos

SQL Warehouse Configuration Tweaking

I'm new to setting up a DB environment, and have accumulated a couple of questions around configuring a SQL Warehouse1. When creating a SQL warehouse, the smallest size is 2X-Small, which is 4DBU. The pricing calculator (for Azure) implies you can c...

Data Engineering

1107 Views
2 replies
2 kudos

01-19-2023 3:35:10 PM

View Replies

Latest Reply

Anonymous
Not applicable

01-19-2023 4:56:14 PM

2 kudos

Docs do show that it uses E8d as you wrote.SQL Warehouses are a different type of compute than All Purpose or Jobs clusters. The SQL warehouses always use Photon. All purpose and Jobs clusters are used for things such as notebooks or Delta Live Tab...

2 kudos

01-19-2023 4:56:14 PM

1 More Replies

by AmithAdiraju16 • New Contributor II

01-06-2023 9:14:55 AM

1424 Views
4 replies
1 kudos

How to read feature table without target_df / online inference based on filter_condition in databricks feature store

I'm using databricks feature store == 0.6.1. After I register my feature table with `create_feature_table` and write data with `write_Table` I want to read that feature_table based on filter conditions ( may be on time stamp column ) without calling ...

Data Engineering

1424 Views
4 replies
1 kudos

01-06-2023 9:14:55 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-06-2023 2:54:05 PM

1 kudos

create_training_set is just a simple Select from delta tables. All feature tables are just registered delta tables. Here is an example code that I used to handle that: customer_features_df = spark.sql("SELECT * FROM recommender_system.customer_fea...

1 kudos

01-06-2023 2:54:05 PM

3 More Replies

by Gim • Contributor

01-18-2023 6:31:02 AM

3282 Views
2 replies
1 kudos

Resolved! How to use SQL UDFs for Delta Live Table pipelines?

I've been searching for a way to use a SQL UDF for our DLT pipeline. In this case it is to convert a time duration string into INT seconds. How exactly do we use/apply UDFs in this case?

Data Engineering

3282 Views
2 replies
1 kudos

01-18-2023 6:31:02 AM

View Replies

Latest Reply

daniel_sahal
Esteemed Contributor

01-18-2023 11:02:15 PM

1 kudos

@GimYou can create Python UDF and then use it in SQL.https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-cookbook.html#use-python-udfs-in-sql

1 kudos

01-18-2023 11:02:15 PM

1 More Replies

by Bartek • Contributor

01-19-2023 1:02:49 PM

2587 Views
0 replies
1 kudos

How to pass all dag_run.conf parameters to python_wheel_task

I want to trigger Databricks job from Airflow using DatabricksSubmitRunDeferrableOperator and I need to pass configuration params. Here is excerpt from my code (definition is not complete, only crucial properties):from airflow.providers.databricks.op...

Data Engineering

2587 Views
0 replies
1 kudos

01-19-2023 1:02:49 PM

by mala • New Contributor III

01-10-2023 3:25:20 PM

2004 Views
3 replies
2 kudos

Resolved! Unable to reproduce Kmeans Clustering results even after setting seed and tolerance

Hi I have been trying to reproduce Kmeans results with no luckHere is my code snippet:from pyspark.ml.clustering import KMeansKMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001) Can anyone help?

Data Engineering

2004 Views
3 replies
2 kudos

01-10-2023 3:25:20 PM

View Replies

Latest Reply

mala
New Contributor III

01-19-2023 10:52:56 AM

2 kudos

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition. I was able to resolve this by making sure the same data is assigned to the same partitions :df.repartition(num_partitions, "ur_col_id")d...

2 kudos

01-19-2023 10:52:56 AM

2 More Replies

by antoooks • New Contributor III

10-28-2021 11:15:48 PM

5011 Views
6 replies
10 kudos

Resolved! Databricks clusters stuck on Pending and Terminating state indefinitely

Hi everyone,Our company is using Databricks on GKE. It works fine until suddenly when we try to create and terminate clusters today, it got stuck on Pending and Terminating state for hours (now more than 6 hours). There is no conclusion can be drawn ...

Data Engineering

5011 Views
6 replies
10 kudos

10-28-2021 11:15:48 PM

View Replies

Latest Reply

Databricks_Buil
New Contributor III

01-18-2023 11:51:55 PM

10 kudos

Hi @Kurnianto Trilaksono Sutjipto : Figured out after multiple connects that This is typically a cloud provider issue. You can file a support ticket if the issue persists.

10 kudos

01-18-2023 11:51:55 PM

5 More Replies

by Anonymous • Not applicable

06-16-2021 2:03:00 PM

9908 Views
3 replies
1 kudos

Cluster in Pending State for long time

Pending for a long time at this stage “Finding instances for new nodes, acquiring more instances if necessary”. How can this be fixed?

Data Engineering

9908 Views
3 replies
1 kudos

06-16-2021 2:03:00 PM

View Replies

Latest Reply

Databricks_Buil
New Contributor III

01-18-2023 11:50:11 PM

1 kudos

Figured out after multiple connects that This is typically a cloud provider issue. You can file a support ticket if the issue persists.

1 kudos

01-18-2023 11:50:11 PM

2 More Replies

by elgeo • Valued Contributor II

01-17-2023 3:07:08 AM

3760 Views
3 replies
3 kudos

Resolved! Trigger on a table

Hello! Is there an equivalent of Create trigger on a table in Databricks sql?CREATE TRIGGER [schema_name.]trigger_nameON table_nameAFTER {[INSERT],[UPDATE],[DELETE]}[NOT FOR REPLICATION]AS{sql_statements}Thank you in advance!

Data Engineering

3760 Views
3 replies
3 kudos

01-17-2023 3:07:08 AM

View Replies

Latest Reply

AdrianLobacz
Contributor

01-18-2023 11:33:29 PM

3 kudos

You can try Auto Loader: Auto Loader supports two modes for detecting new files: directory listing and file notification.Directory listing: Auto Loader identifies new files by listing the input directory. Directory listing mode allows you to quickly ...

3 kudos

01-18-2023 11:33:29 PM

2 More Replies

by 829023 • New Contributor

01-18-2023 3:57:06 AM

483 Views
1 replies
0 kudos

Fail to load excel data(timeout) in databricks sample notebook

Im working with the sample notebook named '1_Customer Lifetimes.py' in https://github.com/databricks-industry-solutions/customer-lifetime-valueIn notebook, there is the code like this `%run "./config/Data Extract"`This load excel data however it occu...

Data Engineering

483 Views
1 replies
0 kudos

01-18-2023 3:57:06 AM

View Replies

Latest Reply

daniel_sahal
Esteemed Contributor

01-18-2023 11:08:39 PM

0 kudos

@Seungsu Lee It could be a destination host issue, configuration issue or network issue.Hard to guess, first check if your cluster has an access to the public internet by running this command:%sh ping -c 2 google.com

0 kudos

01-18-2023 11:08:39 PM

by Phani1 • Valued Contributor

01-18-2023 12:05:31 AM

2360 Views
1 replies
0 kudos

Parent Hierarchy Queries/ Path Function /Recursive CTE's

Problem Statement:We have a scenario where we get the data from the source in the format of (in actual 20 Levels and number of fields are more than 4 but for ease of understanding let’s consider below)The actual code involved 20 levels of 4-5 fields ...

Data Engineering

2360 Views
1 replies
0 kudos

01-18-2023 12:05:31 AM

View Replies

Latest Reply

daniel_sahal
Esteemed Contributor

01-18-2023 10:54:39 PM

0 kudos

I don't think that we have anything similar as a built-in function. You'll need to write some custom code to achieve that.

0 kudos

01-18-2023 10:54:39 PM

by 477061 • Contributor

11-24-2022 12:49:26 AM

3393 Views
12 replies
13 kudos

Resolved! Is it possible to use other databases within Delta Live Tables (DLT)?

I have set up a DLT with "testing" set as the target database. I need to join data that exists in a "keys" table in my "beta" database, but I get an AccessDeniedException, despite having full access to both databases via a normal notebook.A snippet d...

Data Engineering

3393 Views
12 replies
13 kudos

11-24-2022 12:49:26 AM

View Replies

Latest Reply

477061
Contributor

01-18-2023 7:54:07 AM

13 kudos

As an update to this issue: I was running the DLT pipeline on a personal cluster that had an instance profile defined (as per databricks best practises). As a result, the pipeline did not have permission to access other s3 resources (e.g other databa...

13 kudos

01-18-2023 7:54:07 AM

11 More Replies

by explorer • New Contributor III

01-11-2023 4:10:40 AM

3549 Views
6 replies
3 kudos

Getting error while loading parquet data into Postgres (using spark-postgres library) ClassNotFoundException: Failed to find data source: postgres. Please find packages at http://spark.apache.org/third-party-projects.html Caused by: ClassNotFoundException

Hi Fellas - I'm trying to load parquet data (in GCS location) into Postgres DB (google cloud) . For bulk upload data into PG we are using (spark-postgres library)https://framagit.org/interhop/library/spark-etl/-/tree/master/spark-postgres/src/main/sc...

Data Engineering

3549 Views
6 replies
3 kudos

01-11-2023 4:10:40 AM

View Replies

Latest Reply

explorer
New Contributor III

01-18-2023 7:44:11 AM

3 kudos

Hi @Kaniz Fatma , @Daniel Sahal - Few updates from my side.After so many hits and trials , psycopg2 worked out in my case.We can process 200+GB data with 10 node cluster (n2-highmem-4,32 GB Memory, 4 Cores) and driver 32 GB Memory, 4 Cores with Run...

3 kudos

01-18-2023 7:44:11 AM

5 More Replies

by 138999 • New Contributor

01-18-2023 3:59:03 AM

576 Views
1 replies
0 kudos

How are parallel and subsequent jobs handled by cluster?

Hello,Apologize for dumb question but i'm new to Databricks and need clarification on following.Are parallel and subsequent jobs able to reuse the same compute resources to keep time and cost overhead as low as possible vs. are they spinning a new cl...

Data Engineering

576 Views
1 replies
0 kudos

01-18-2023 3:59:03 AM

View Replies

Latest Reply

daniel_sahal
Esteemed Contributor

01-18-2023 4:55:02 AM

0 kudos

@tanja.savic tanja.savic You can use shared job cluster:https://docs.databricks.com/workflows/jobs/jobs.html#use-shared-job-clustersBut remember that a shared job cluster is scoped to a single job run, and cannot be used by other jobs or runs of the...

0 kudos

01-18-2023 4:55:02 AM

by Phani1 • Valued Contributor

01-17-2023 10:32:31 PM

662 Views
1 replies
1 kudos

Resolved! Databricks - Calling dashboard another dashboard..

Hi Team ,Can we call the dashboard from another dashboard? An example screenshot is attached.Main Dashboard has 3 buttons that point to 3 different dashboards and if we click any of the buttons it has to redirect to the respective dashboard.

Data Engineering

662 Views
1 replies
1 kudos

01-17-2023 10:32:31 PM

View Replies

Latest Reply

daniel_sahal
Esteemed Contributor

01-17-2023 10:39:53 PM

1 kudos

@Janga Reddy I don't think that this is possible at this moment.You can raise a feature request here: https://docs.databricks.com/resources/ideas.html

1 kudos

01-17-2023 10:39:53 PM

by Ancil • Contributor II

01-17-2023 3:08:23 AM

1776 Views
3 replies
1 kudos

Resolved! PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.

I have pandas_udf, its working for 1 rows, but I tried with more than one rows getting below error.PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output w...

Data Engineering

1776 Views
3 replies
1 kudos

01-17-2023 3:08:23 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-17-2023 4:18:21 AM

1 kudos

I was testing, and your function is correct. So you need to have an error in inputData type (is all string) or with result_json. Please also check the runtime version. I was using 11 LTS.

1 kudos

01-17-2023 4:18:21 AM

2 More Replies

User

Count

1602

737

348

285

247

Databricks Community

Forum Posts

SQL Warehouse Configuration Tweaking

How to read feature table without target_df / online inference based on filter_condition in databricks feature store

Resolved! How to use SQL UDFs for Delta Live Table pipelines?

How to pass all dag_run.conf parameters to python_wheel_task

Resolved! Unable to reproduce Kmeans Clustering results even after setting seed and tolerance

Resolved! Databricks clusters stuck on Pending and Terminating state indefinitely

Cluster in Pending State for long time

Resolved! Trigger on a table

Fail to load excel data(timeout) in databricks sample notebook

Parent Hierarchy Queries/ Path Function /Recursive CTE's

Resolved! Is it possible to use other databases within Delta Live Tables (DLT)?

Getting error while loading parquet data into Postgres (using spark-postgres library) ClassNotFoundException: Failed to find data source: postgres. Please find packages at http://spark.apache.org/third-party-projects.html Caused by: ClassNotFoundException

How are parallel and subsequent jobs handled by cluster?

Resolved! Databricks - Calling dashboard another dashboard..

Resolved! PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.

Getting com.databricks.client.jdbc.Driver is not f...

Unit Testing DLT Pipelines

Retrieve job-level parameters in spark_python_task...

Cannot pass arrays to spark.sql() using named para...

unity catalog with external table and column maski...