cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

Anonymous
by Not applicable
  • 10346 Views
  • 3 replies
  • 1 kudos

Cluster in Pending State for long time

Pending for a long time at this stage “Finding instances for new nodes, acquiring more instances if necessary”. How can this be fixed?

  • 10346 Views
  • 3 replies
  • 1 kudos
Latest Reply
Databricks_Buil
New Contributor III
  • 1 kudos

Figured out after multiple connects that This is typically a cloud provider issue. You can file a support ticket if the issue persists.

  • 1 kudos
2 More Replies
elgeo
by Valued Contributor II
  • 4137 Views
  • 3 replies
  • 3 kudos

Resolved! Trigger on a table

Hello! Is there an equivalent of Create trigger on a table in Databricks sql?CREATE TRIGGER [schema_name.]trigger_nameON table_nameAFTER {[INSERT],[UPDATE],[DELETE]}[NOT FOR REPLICATION]AS{sql_statements}Thank you in advance!

  • 4137 Views
  • 3 replies
  • 3 kudos
Latest Reply
AdrianLobacz
Contributor
  • 3 kudos

You can try Auto Loader: Auto Loader supports two modes for detecting new files: directory listing and file notification.Directory listing: Auto Loader identifies new files by listing the input directory. Directory listing mode allows you to quickly ...

  • 3 kudos
2 More Replies
829023
by New Contributor
  • 526 Views
  • 1 replies
  • 0 kudos

Fail to load excel data(timeout) in databricks sample notebook

Im working with the sample notebook named '1_Customer Lifetimes.py' in https://github.com/databricks-industry-solutions/customer-lifetime-valueIn notebook, there is the code like this `%run "./config/Data Extract"`This load excel data however it occu...

  • 526 Views
  • 1 replies
  • 0 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 0 kudos

@Seungsu Lee​  It could be a destination host issue, configuration issue or network issue.Hard to guess, first check if your cluster has an access to the public internet by running this command:%sh ping -c 2 google.com

  • 0 kudos
Phani1
by Valued Contributor
  • 2509 Views
  • 1 replies
  • 0 kudos

Parent Hierarchy Queries/ Path Function /Recursive CTE's

Problem Statement:We have a scenario where we get the data from the source in the format of (in actual 20 Levels and number of fields are more than 4 but for ease of understanding let’s consider below)The actual code involved 20 levels of 4-5 fields ...

  • 2509 Views
  • 1 replies
  • 0 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 0 kudos

I don't think that we have anything similar as a built-in function. You'll need to write some custom code to achieve that.

  • 0 kudos
477061
by Contributor
  • 3815 Views
  • 12 replies
  • 13 kudos

Resolved! Is it possible to use other databases within Delta Live Tables (DLT)?

I have set up a DLT with "testing" set as the target database. I need to join data that exists in a "keys" table in my "beta" database, but I get an AccessDeniedException, despite having full access to both databases via a normal notebook.A snippet d...

  • 3815 Views
  • 12 replies
  • 13 kudos
Latest Reply
477061
Contributor
  • 13 kudos

As an update to this issue: I was running the DLT pipeline on a personal cluster that had an instance profile defined (as per databricks best practises). As a result, the pipeline did not have permission to access other s3 resources (e.g other databa...

  • 13 kudos
11 More Replies
explorer
by New Contributor III
  • 3869 Views
  • 6 replies
  • 3 kudos

Getting error while loading parquet data into Postgres (using spark-postgres library) ClassNotFoundException: Failed to find data source: postgres. Please find packages at http://spark.apache.org/third-party-projects.html Caused by: ClassNotFoundException

Hi Fellas - I'm trying to load parquet data (in GCS location) into Postgres DB (google cloud) . For bulk upload data into PG we are using (spark-postgres library)https://framagit.org/interhop/library/spark-etl/-/tree/master/spark-postgres/src/main/sc...

  • 3869 Views
  • 6 replies
  • 3 kudos
Latest Reply
explorer
New Contributor III
  • 3 kudos

Hi @Kaniz Fatma​ , @Daniel Sahal​ - Few updates from my side.After so many hits and trials , psycopg2 worked out in my case.We can process 200+GB data with 10 node cluster (n2-highmem-4,32 GB Memory, 4 Cores) and driver 32 GB Memory, 4 Cores with Run...

  • 3 kudos
5 More Replies
138999
by New Contributor
  • 635 Views
  • 1 replies
  • 0 kudos

How are parallel and subsequent jobs handled by cluster?

Hello,Apologize for dumb question but i'm new to Databricks and need clarification on following.Are parallel and subsequent jobs able to reuse the same compute resources to keep time and cost overhead as low as possible vs. are they spinning a new cl...

  • 635 Views
  • 1 replies
  • 0 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 0 kudos

@tanja.savic tanja.savic​ You can use shared job cluster:https://docs.databricks.com/workflows/jobs/jobs.html#use-shared-job-clustersBut remember that a shared job cluster is scoped to a single job run, and cannot be used by other jobs or runs of the...

  • 0 kudos
Phani1
by Valued Contributor
  • 764 Views
  • 1 replies
  • 1 kudos

Resolved! Databricks - Calling dashboard another dashboard..

Hi Team ,Can we call the dashboard from another dashboard? An example screenshot is attached.Main Dashboard has 3 buttons that point to 3 different dashboards and if we click any of the buttons it has to redirect to the respective dashboard.

  • 764 Views
  • 1 replies
  • 1 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 1 kudos

@Janga Reddy​ I don't think that this is possible at this moment.You can raise a feature request here: https://docs.databricks.com/resources/ideas.html

  • 1 kudos
Ancil
by Contributor II
  • 1991 Views
  • 3 replies
  • 1 kudos

Resolved! PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.

I have pandas_udf, its working for 1 rows, but I tried with more than one rows getting below error.PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output w...

  • 1991 Views
  • 3 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

I was testing, and your function is correct. So you need to have an error in inputData type (is all string) or with result_json. Please also check the runtime version. I was using 11 LTS. 

  • 1 kudos
2 More Replies
Brave
by New Contributor II
  • 2982 Views
  • 6 replies
  • 4 kudos

Resolved! Exporting R data frame variable

Hi all.I am trying to export R data frame variable as csv file.I am using this formula:df<- data.frame(VALIDADOR_FIM)df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/df/df.csv")But isn´t working. ...

  • 2982 Views
  • 6 replies
  • 4 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 4 kudos

Hi @FELIPE VALENTE​  (Customer)​, We haven’t heard from you since the last response from @sherbin w​  (Customer)​ , and I was checking back to see if his suggestions helped you.Or else, If you have any solution, please share it with the community, as...

  • 4 kudos
5 More Replies
Prem1
by New Contributor III
  • 11070 Views
  • 21 replies
  • 11 kudos

java.lang.IllegalArgumentException: java.net.URISyntaxException

I am using Databricks Autoloader to load JSON files from ADLS gen2 incrementally in directory listing mode. All source filename has Timestamp on them. The autoloader works perfectly couple of days with the below configuration and breaks the next day ...

  • 11070 Views
  • 21 replies
  • 11 kudos
Latest Reply
jshields
New Contributor II
  • 11 kudos

Hi Everyone,I'm seeing this issue as well - same configuration of the previous posts, using autoloader with incremental file listing turned on. The strange part is that it mostly works despite almost all of the files we're loading having colons incl...

  • 11 kudos
20 More Replies
Sandesh87
by New Contributor III
  • 3133 Views
  • 4 replies
  • 2 kudos

spark-streaming read from specific event hub partition

The azure event hub "my_event_hub" has a total of 5 partitions ("0", "1", "2", "3", "4")The readstream should only read events from partitions "0" and "4"event hub configuration as streaming source:-val name = "my_event_hub" val connectionString = "m...

  • 3133 Views
  • 4 replies
  • 2 kudos
Latest Reply
keshav
New Contributor II
  • 2 kudos

I tried using below snippet to receive messages only from partition id=0ehName = "<<EVENT-HUB-NAME>>"   # Create event position for partition 0 positionKey1 = { "ehName": ehName, "partitionId": 0 }   eventPosition1 = { "offset": "@latest", ...

  • 2 kudos
3 More Replies
databicky
by Contributor II
  • 3106 Views
  • 3 replies
  • 0 kudos

Resolved! how to add the background color to excel sheet by python

i just want to add color to excel sheet by python to specific cells, and i done that, but i need to exclude the header column, then if i tried the same method to other sheet it doesn't worked.​​but that bg color addition is reflected in one sheet but...

  • 3106 Views
  • 3 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 0 kudos

Convert your dataframe to pandas on sparkcolor cells using style property https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.style.htmlexport to excel using pandas to_excel https://spark.apache.org/d...

  • 0 kudos
2 More Replies
alvaro_databric
by New Contributor III
  • 702 Views
  • 1 replies
  • 0 kudos

Relation between Driver and Executor size

HiI would like to ask for recommendations regarding the size of the driver and the amount of executors managed by that driver. I am aware of the best practices regarding executor size/number but I have doubts about the number of executors a single dr...

  • 702 Views
  • 1 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 0 kudos

Depends on your use case. The best is to connect Datatog and see driver and workers utilization https://docs.datadoghq.com/integrations/databricks/?tab=driveronlyJust from my experience, Usually, for big datasets, when need autoscale workers between ...

  • 0 kudos
alvaro_databric
by New Contributor III
  • 1468 Views
  • 1 replies
  • 1 kudos

Resolved! Task time Spark UI

Hello all,I would like to know why task times (among other times in Spark UI) display values like 1h 2h when the task does only really take some seconds or minutes. What is the meaning of these high time values I see all around Spark UI.Thanks in adv...

  • 1468 Views
  • 1 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

that is accumulated time.https://stackoverflow.com/questions/73302982/task-time-and-gc-time-calculation-in-spark-ui-in-executor-section.

  • 1 kudos
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels