cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

najmead
by Contributor
  • 1107 Views
  • 2 replies
  • 2 kudos

SQL Warehouse Configuration Tweaking

I'm new to setting up a DB environment, and have accumulated a couple of questions around configuring a SQL Warehouse1. When creating a SQL warehouse, the smallest size is 2X-Small, which is 4DBU. The pricing calculator (for Azure) implies you can c...

  • 1107 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Docs do show that it uses E8d as you wrote.SQL Warehouses are a different type of compute than All Purpose or Jobs clusters. The SQL warehouses always use Photon. All purpose and Jobs clusters are used for things such as notebooks or Delta Live Tab...

  • 2 kudos
1 More Replies
AmithAdiraju16
by New Contributor II
  • 1424 Views
  • 4 replies
  • 1 kudos

How to read feature table without target_df / online inference based on filter_condition in databricks feature store

I'm using databricks feature store == 0.6.1. After I register my feature table with `create_feature_table` and write data with `write_Table` I want to read that feature_table based on filter conditions ( may be on time stamp column ) without calling ...

  • 1424 Views
  • 4 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

create_training_set is just a simple Select from delta tables. All feature tables are just registered delta tables. Here is an example code that I used to handle that: customer_features_df = spark.sql("SELECT * FROM recommender_system.customer_fea...

  • 1 kudos
3 More Replies
Gim
by Contributor
  • 3282 Views
  • 2 replies
  • 1 kudos

Resolved! How to use SQL UDFs for Delta Live Table pipelines?

I've been searching for a way to use a SQL UDF for our DLT pipeline. In this case it is to convert a time duration string into INT seconds. How exactly do we use/apply UDFs in this case?

  • 3282 Views
  • 2 replies
  • 1 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 1 kudos

@GimYou can create Python UDF and then use it in SQL.https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-cookbook.html#use-python-udfs-in-sql

  • 1 kudos
1 More Replies
Bartek
by Contributor
  • 2587 Views
  • 0 replies
  • 1 kudos

How to pass all dag_run.conf parameters to python_wheel_task

I want to trigger Databricks job from Airflow using DatabricksSubmitRunDeferrableOperator and I need to pass configuration params. Here is excerpt from my code (definition is not complete, only crucial properties):from airflow.providers.databricks.op...

  • 2587 Views
  • 0 replies
  • 1 kudos
mala
by New Contributor III
  • 2004 Views
  • 3 replies
  • 2 kudos

Resolved! Unable to reproduce Kmeans Clustering results even after setting seed and tolerance

Hi I have been trying to reproduce Kmeans results with no luckHere is my code snippet:from pyspark.ml.clustering import KMeansKMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001) Can anyone help?

  • 2004 Views
  • 3 replies
  • 2 kudos
Latest Reply
mala
New Contributor III
  • 2 kudos

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition. I was able to resolve this by making sure the same data is assigned to the same partitions :df.repartition(num_partitions, "ur_col_id")d...

  • 2 kudos
2 More Replies
antoooks
by New Contributor III
  • 5011 Views
  • 6 replies
  • 10 kudos

Resolved! Databricks clusters stuck on Pending and Terminating state indefinitely

Hi everyone,Our company is using Databricks on GKE. It works fine until suddenly when we try to create and terminate clusters today, it got stuck on Pending and Terminating state for hours (now more than 6 hours). There is no conclusion can be drawn ...

screenshot
  • 5011 Views
  • 6 replies
  • 10 kudos
Latest Reply
Databricks_Buil
New Contributor III
  • 10 kudos

Hi @Kurnianto Trilaksono Sutjipto​ : Figured out after multiple connects that This is typically a cloud provider issue. You can file a support ticket if the issue persists.

  • 10 kudos
5 More Replies
Anonymous
by Not applicable
  • 9908 Views
  • 3 replies
  • 1 kudos

Cluster in Pending State for long time

Pending for a long time at this stage “Finding instances for new nodes, acquiring more instances if necessary”. How can this be fixed?

  • 9908 Views
  • 3 replies
  • 1 kudos
Latest Reply
Databricks_Buil
New Contributor III
  • 1 kudos

Figured out after multiple connects that This is typically a cloud provider issue. You can file a support ticket if the issue persists.

  • 1 kudos
2 More Replies
elgeo
by Valued Contributor II
  • 3760 Views
  • 3 replies
  • 3 kudos

Resolved! Trigger on a table

Hello! Is there an equivalent of Create trigger on a table in Databricks sql?CREATE TRIGGER [schema_name.]trigger_nameON table_nameAFTER {[INSERT],[UPDATE],[DELETE]}[NOT FOR REPLICATION]AS{sql_statements}Thank you in advance!

  • 3760 Views
  • 3 replies
  • 3 kudos
Latest Reply
AdrianLobacz
Contributor
  • 3 kudos

You can try Auto Loader: Auto Loader supports two modes for detecting new files: directory listing and file notification.Directory listing: Auto Loader identifies new files by listing the input directory. Directory listing mode allows you to quickly ...

  • 3 kudos
2 More Replies
829023
by New Contributor
  • 483 Views
  • 1 replies
  • 0 kudos

Fail to load excel data(timeout) in databricks sample notebook

Im working with the sample notebook named '1_Customer Lifetimes.py' in https://github.com/databricks-industry-solutions/customer-lifetime-valueIn notebook, there is the code like this `%run "./config/Data Extract"`This load excel data however it occu...

  • 483 Views
  • 1 replies
  • 0 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 0 kudos

@Seungsu Lee​  It could be a destination host issue, configuration issue or network issue.Hard to guess, first check if your cluster has an access to the public internet by running this command:%sh ping -c 2 google.com

  • 0 kudos
Phani1
by Valued Contributor
  • 2360 Views
  • 1 replies
  • 0 kudos

Parent Hierarchy Queries/ Path Function /Recursive CTE's

Problem Statement:We have a scenario where we get the data from the source in the format of (in actual 20 Levels and number of fields are more than 4 but for ease of understanding let’s consider below)The actual code involved 20 levels of 4-5 fields ...

  • 2360 Views
  • 1 replies
  • 0 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 0 kudos

I don't think that we have anything similar as a built-in function. You'll need to write some custom code to achieve that.

  • 0 kudos
477061
by Contributor
  • 3393 Views
  • 12 replies
  • 13 kudos

Resolved! Is it possible to use other databases within Delta Live Tables (DLT)?

I have set up a DLT with "testing" set as the target database. I need to join data that exists in a "keys" table in my "beta" database, but I get an AccessDeniedException, despite having full access to both databases via a normal notebook.A snippet d...

  • 3393 Views
  • 12 replies
  • 13 kudos
Latest Reply
477061
Contributor
  • 13 kudos

As an update to this issue: I was running the DLT pipeline on a personal cluster that had an instance profile defined (as per databricks best practises). As a result, the pipeline did not have permission to access other s3 resources (e.g other databa...

  • 13 kudos
11 More Replies
explorer
by New Contributor III
  • 3549 Views
  • 6 replies
  • 3 kudos

Getting error while loading parquet data into Postgres (using spark-postgres library) ClassNotFoundException: Failed to find data source: postgres. Please find packages at http://spark.apache.org/third-party-projects.html Caused by: ClassNotFoundException

Hi Fellas - I'm trying to load parquet data (in GCS location) into Postgres DB (google cloud) . For bulk upload data into PG we are using (spark-postgres library)https://framagit.org/interhop/library/spark-etl/-/tree/master/spark-postgres/src/main/sc...

  • 3549 Views
  • 6 replies
  • 3 kudos
Latest Reply
explorer
New Contributor III
  • 3 kudos

Hi @Kaniz Fatma​ , @Daniel Sahal​ - Few updates from my side.After so many hits and trials , psycopg2 worked out in my case.We can process 200+GB data with 10 node cluster (n2-highmem-4,32 GB Memory, 4 Cores) and driver 32 GB Memory, 4 Cores with Run...

  • 3 kudos
5 More Replies
138999
by New Contributor
  • 576 Views
  • 1 replies
  • 0 kudos

How are parallel and subsequent jobs handled by cluster?

Hello,Apologize for dumb question but i'm new to Databricks and need clarification on following.Are parallel and subsequent jobs able to reuse the same compute resources to keep time and cost overhead as low as possible vs. are they spinning a new cl...

  • 576 Views
  • 1 replies
  • 0 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 0 kudos

@tanja.savic tanja.savic​ You can use shared job cluster:https://docs.databricks.com/workflows/jobs/jobs.html#use-shared-job-clustersBut remember that a shared job cluster is scoped to a single job run, and cannot be used by other jobs or runs of the...

  • 0 kudos
Phani1
by Valued Contributor
  • 662 Views
  • 1 replies
  • 1 kudos

Resolved! Databricks - Calling dashboard another dashboard..

Hi Team ,Can we call the dashboard from another dashboard? An example screenshot is attached.Main Dashboard has 3 buttons that point to 3 different dashboards and if we click any of the buttons it has to redirect to the respective dashboard.

  • 662 Views
  • 1 replies
  • 1 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 1 kudos

@Janga Reddy​ I don't think that this is possible at this moment.You can raise a feature request here: https://docs.databricks.com/resources/ideas.html

  • 1 kudos
Ancil
by Contributor II
  • 1776 Views
  • 3 replies
  • 1 kudos

Resolved! PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.

I have pandas_udf, its working for 1 rows, but I tried with more than one rows getting below error.PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output w...

  • 1776 Views
  • 3 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

I was testing, and your function is correct. So you need to have an error in inputData type (is all string) or with result_json. Please also check the runtime version. I was using 11 LTS. 

  • 1 kudos
2 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels
Top Kudoed Authors