cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

alvaro_databric
by New Contributor III
  • 2126 Views
  • 1 replies
  • 2 kudos

Resolved! Fastest Azure VM for Databricks Big Data workload

Hi All,It is well known that Azure provides a wide variety of VM for Databricks, some of which provide powerful features such as Photon and Delta Caching. I would like to ask the community which do you think is the fastests cluster for performing Big...

  • 2126 Views
  • 1 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Alvaro Moure​ :The performance of a Databricks cluster for big data operations depends on many factors, such as the amount and structure of the data, the nature of the operations being performed, the configuration of the cluster, and the specific re...

  • 2 kudos
Sujitha
by Databricks Employee
  • 7364 Views
  • 0 replies
  • 2 kudos

Weekly Release Notes RecapHere’s a quick recap of the latest release notes updates from the past one week. Databricks platform release notesMarch 13 -...

Weekly Release Notes RecapHere’s a quick recap of the latest release notes updates from the past one week.Databricks platform release notesMarch 13 - 17, 2023Execute SQL cells in the notebook in parallelYou can now run SQL cells in Databricks noteboo...

  • 7364 Views
  • 0 replies
  • 2 kudos
Arunsundar
by New Contributor III
  • 2891 Views
  • 4 replies
  • 4 kudos

The possibility of finding the workload dynamically and spin up the cluster based on the workload

Hi Team,Good morning. I would like to understand if there is a possibility to determine the workload automatically through code (data load from a file to a table, determine the file size, kind of a benchmark that we can check), based on which we can ...

  • 2891 Views
  • 4 replies
  • 4 kudos
Latest Reply
pvignesh92
Honored Contributor
  • 4 kudos

Hi @Arunsundar Muthumanickam​ , When you say workload, I believe you might be handling various volumes of data between Dev and Prod environment. If you are using Databricks cluster and do not have much idea on how the volumes might turn out in differ...

  • 4 kudos
3 More Replies
DipakBachhav
by New Contributor III
  • 13050 Views
  • 3 replies
  • 3 kudos

Resolved! Geting error Caused by: com.databricks.NotebookExecutionException: FAILED

I am trying to run the below notebook through databricks but getting the below error. I have tried to update the notebook timeout and the retry mechanism but still no luck yet.   NotebookData("/Users/mynotebook",9900, retry=3)   ]   res = parallelNot...

  • 13050 Views
  • 3 replies
  • 3 kudos
Latest Reply
sujai_sparks
New Contributor III
  • 3 kudos

Hi @Dipak Bachhav​, not sure if you have fixed the issue, but here are few things you can check: Is the path "/Users/mynotebook" correct? Maybe you are missing the dot in the beginning.Run the notebook using dbutils.notebook.run("/Users/mynotebook") ...

  • 3 kudos
2 More Replies
ossinova
by Contributor II
  • 1643 Views
  • 1 replies
  • 2 kudos

PIVOT on month and quarter

I want to simplify this query:SELECT year(EntryDate) Year, AccountNumber, sum(CreditBase - DebitBase) FILTER(WHERE month(EntryDate) = 1) AS jan_total, sum(CreditBase - DebitBase) FILTER(WHERE month(EntryDate) = 2) AS feb_total, sum(CreditBase - Debi...

  • 1643 Views
  • 1 replies
  • 2 kudos
Latest Reply
Lakshay
Databricks Employee
  • 2 kudos

Hi @Oscar Dyremyhr​ , PIVOT doesn't support two FOR clauses. You can PIVOT either on month or on quarter.https://docs.databricks.com/sql/language-manual/sql-ref-syntax-qry-select-pivot.html

  • 2 kudos
Dale_Ware
by New Contributor III
  • 2497 Views
  • 2 replies
  • 3 kudos

Resolved! How to query a table with backslashes in the name.

I am trying to query a snowflake table from a databricks data frame similar to the following example.sql_query = "select * from Database.Schema.Table_/Name_/V"sqlContext.sql(f"{sql_query}" ) And I get an error like this.ParseException: [PARSE_SYNTAX_...

  • 2497 Views
  • 2 replies
  • 3 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 3 kudos

You can use Double Quotes to get the plan. Using quotes it is important to write the table names in capital letters.SELECT * FROM "/TABLE/NAME"

  • 3 kudos
1 More Replies
ramankr48
by Contributor II
  • 16161 Views
  • 5 replies
  • 8 kudos

Resolved! How to get all the tables name with a specific column or columns in a database?

let's say there is a database db in which 700 tables are there, and we need to find all the tables name in which column "project_id" is present.just an example for ubderstanding the questions.

  • 16161 Views
  • 5 replies
  • 8 kudos
Latest Reply
Anonymous
Not applicable
  • 8 kudos

databaseName = "db" desiredColumn = "project_id" database = spark.sql(f"show tables in {databaseName} ").collect() tablenames = [] for row in database: cols = spark.table(row.tableName).columns if desiredColumn in cols: tablenames.append(row....

  • 8 kudos
4 More Replies
William_Scardua
by Valued Contributor
  • 2087 Views
  • 3 replies
  • 1 kudos

Resolved! Upsert When the Origin NOT Exists, but you need to change status in the target

Hi guys,I have a question about upsert/merge ... What do you do when que origin NOT exists, but you need to change status in the target​For exemple:01/03 : source dataset [ id =1 and status = Active] ; target table [*not exists*] >> in this time the ...

  • 2087 Views
  • 3 replies
  • 1 kudos
Latest Reply
NandiniN
Databricks Employee
  • 1 kudos

Hello @William Scardua​ , Just adding to what @Vigneshraja Palaniraj​ replied.Reference: https://docs.databricks.com/sql/language-manual/delta-merge-into.htmlThanks & Regards,Nandini

  • 1 kudos
2 More Replies
Ovi
by New Contributor III
  • 4525 Views
  • 5 replies
  • 3 kudos

Resolved! Filter only Delta tables from an S3 folders list

Hello everyone,From a list of folders on s3, how can I filter which ones are Delta tables, without trying to read each one at a time?Thanks,Ovi

  • 4525 Views
  • 5 replies
  • 3 kudos
Latest Reply
NandiniN
Databricks Employee
  • 3 kudos

Hello @Ovidiu Eremia​ ,To filter which folders on S3 contain Delta tables, you can look for the specific files that are associated with Delta tables. Delta Lake stores its metadata in a hidden folder named _delta_log, which is located at the root of ...

  • 3 kudos
4 More Replies
Dataengineer_mm
by New Contributor
  • 2355 Views
  • 1 replies
  • 1 kudos

Surrogate key using identity column.

I want to create a surrogate in the delta table And i used the identity column id-Generated as DefaultCan i insert rows into the delta table using only spark.sql like Insert query ? or i can also use write delta format options? If i use the df.write ...

  • 2355 Views
  • 1 replies
  • 1 kudos
Latest Reply
NandiniN
Databricks Employee
  • 1 kudos

Hello @Menaka Murugesan​ ,If you are using the identity column, I believe you would have created the table as below, (starts with value 1 and step 1)CREATE TABLE my_table ( id INT IDENTITY (1, 1) PRIMARY KEY, value STRING )You can insert values i...

  • 1 kudos
sanjay
by Valued Contributor II
  • 8067 Views
  • 3 replies
  • 5 kudos

Resolved! PySpark UDF is taking long to process

Hi,I have UDF which runs for each spark dataframe row, does some complex processing and return string output. But it takes very long if data is 15000 rows. I have configured cluster with autoscaling, but its not spinning more servers.Please suggest h...

  • 8067 Views
  • 3 replies
  • 5 kudos
Latest Reply
Lakshay
Databricks Employee
  • 5 kudos

Hi @Sanjay Jain​ , Python UDFs are generally slower to process because it runs mostly in the driver which can also lead to OOM errors on Driver. To resolve this issue, please consider the below:Use spark built-in functions to do the same functionalit...

  • 5 kudos
2 More Replies
MShee
by New Contributor II
  • 1530 Views
  • 1 replies
  • 1 kudos
  • 1530 Views
  • 1 replies
  • 1 kudos
Latest Reply
NandiniN
Databricks Employee
  • 1 kudos

Hello @M Shee​ ,In a drop down you can select a value from a list of provided values, not type the values in. What you might be interested in is a combobox - It is combination of text and dropdown. It allows to select a value from a provided list or ...

  • 1 kudos
Lu_Wang_SA_DBX
by Databricks Employee
  • 4127 Views
  • 1 replies
  • 3 kudos

We will host the first Databricks Bay Area User Group meeting in the Databricks Mountain View office on March 14 2:45-5:00 pm PT.We'll have Dave M...

We will host the first Databricks Bay Area User Group meeting in the Databricks Mountain View office on March 14 2:45-5:00 pm PT.We'll have Dave Mariani - CTO & Founder at AtScale, and Riley Phillips - Enterprise Solution Engineer at Matillion to sha...

David Mariana - CTO, AtScale Riley Phillips - Enterprise Solution Engineer, Matillion
  • 4127 Views
  • 1 replies
  • 3 kudos
Latest Reply
amitabharora
Databricks Employee
  • 3 kudos

Looking forward.

  • 3 kudos
Everton_Costa
by New Contributor II
  • 1771 Views
  • 2 replies
  • 1 kudos
  • 1771 Views
  • 2 replies
  • 1 kudos
Latest Reply
Cami
Contributor III
  • 1 kudos

I hope it helps:SELECT DATEADD(DAY, rnk - 1, '{{StartDate}}') FROM ( WITH lv0(c) AS( SELECT 1 as c UNION ALL SELECT 1 ) , lv1 AS ( Select t1.c from lv0 t1 cross JOIN lv0 t2 ) , lv2 AS ( Select t1....

  • 1 kudos
1 More Replies
JacintoArias
by New Contributor III
  • 6686 Views
  • 5 replies
  • 1 kudos

Spark predicate pushdown on parquet files when using limit

Hi,While developing an ETL for a large dataset I want to get a sample of the top rows to check that my the pipeline "just runs", so I add a limit clause when reading the dataset.I'm surprised to see that instead of creating a single task as in a sho...

  • 6686 Views
  • 5 replies
  • 1 kudos
Latest Reply
JacekLaskowski
New Contributor III
  • 1 kudos

It's been a while since the question was asked, and in the meantime Delta Lake 2.2.0 hit the shelves with the exact feature the OP asked about, i.e. LIMIT pushdown:LIMIT pushdown into Delta scan. Improve the performance of queries containing LIMIT cl...

  • 1 kudos
4 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels