Data Engineering

Forum Posts

Sorted by:

by User16776430979 • New Contributor III

06-07-2021 9:57:16 AM

52990 Views
4 replies
5 kudos

Best practices around bronze/silver/gold (medallion model) data lake classification?

What's the best way to organize our data lake and delta setup? We’re trying to use the bronze, silver and gold classification strategy. The main question is how do we know what classification the data is inside Databricks if there’s no actual physica...

Data Engineering

52990 Views
4 replies
5 kudos

06-07-2021 9:57:16 AM

View Replies

Latest Reply

G_E
New Contributor II

01-15-2025 2:01:06 AM

5 kudos

Has the reply from @Retired_mod been removed?

5 kudos

01-15-2025 2:01:06 AM

3 More Replies

by Anonymous • Not applicable

06-07-2021 10:50:07 AM

13339 Views
15 replies
8 kudos

Resolved! What are some best practices for CICD?

A number of people have questions on using Databricks in a productionalized environment. What are the best practices to enable CICD automation?

Data Engineering

13339 Views
15 replies
8 kudos

06-07-2021 10:50:07 AM

View Replies

Latest Reply

BaivabMohanty
New Contributor II

04-24-2024 8:49:08 AM

8 kudos

Any leads/posts for Databricks CI/CD integration with Bitbucket pipeline. I am facing the below error while I creation my CICD pipeline pipelines:branches:master:- step:name: Deploy Databricks Changesimage: docker:19.03.12services:- dockerscript:# U...

8 kudos

04-24-2024 8:49:08 AM

14 More Replies

by Smitha1 • Valued Contributor II

10-26-2022 12:41:47 AM

4390 Views
9 replies
3 kudos

Databricks Certified Associate Developer for Apache Spark 3.0

Data Engineering

4390 Views
9 replies
3 kudos

10-26-2022 12:41:47 AM

View Replies

Latest Reply

Shivam_Patil
New Contributor II

11-22-2023 4:06:30 AM

3 kudos

Hey I am looking for sample papers for the above exam other than the one provided by databricks do any one have any idea about it

3 kudos

11-22-2023 4:06:30 AM

8 More Replies

by WillHeyer • New Contributor II

05-10-2023 1:43:54 PM

7957 Views
1 replies
2 kudos

Resolved! Best Practices for PowerBI Connectivity w/ Partner Connect. Access Token w/ Service Principal, Databricks Username w/ Service account, or OAuth?

I'm aware all are possible methods but are all equal? Or is the matter trivial? Thank you so much!

Data Engineering

7957 Views
1 replies
2 kudos

05-10-2023 1:43:54 PM

View Replies

Latest Reply

Anonymous
Not applicable

05-13-2023 9:08:08 AM

2 kudos

@Will Heyer :The best method for Power BI connectivity with Partner Connect depends on your specific use case and requirements. Here are some factors to consider for each method:Access Token with Service Principal: This method uses a client ID and s...

2 kudos

05-13-2023 9:08:08 AM

by Anonymous • Not applicable

04-09-2023 7:22:14 PM

2452 Views
3 replies
2 kudos

www.dbdemos.ai

Getting started with Databricks is being made very easy now. Presenting dbdemos.If you're looking to get started with Databricks, there's good news: dbdemos makes it easier than ever. This platform offers a range of demos that you can install directl...

Data Engineering

2452 Views
3 replies
2 kudos

04-09-2023 7:22:14 PM

View Replies

Latest Reply

FJ
Contributor III

04-24-2023 7:27:18 PM

2 kudos

That's a great share Suteja. Is that supposed to work with the Databricks Community edition account? Had a strange error while trying. Any help is appreciated!Thanks,F

2 kudos

04-24-2023 7:27:18 PM

2 More Replies

by Phani1 • Valued Contributor II

01-30-2023 3:11:52 AM

4082 Views
1 replies
0 kudos

best practices/steps for hive meta store backup and restore.

Hi Team,Could you share with us the best practices/steps for hive meta store backup and restore?Regards,Phanindra

Data Engineering

4082 Views
1 replies
0 kudos

01-30-2023 3:11:52 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-10-2023 6:56:20 AM

0 kudos

@Janga Reddy :Certainly! Here are the steps for Hive metastore backup and restore on Databricks:Backup:Stop all running Hive services and jobs on the Databricks cluster.Create a backup directory in DBFS (Databricks File System) where the metadata fi...

0 kudos

04-10-2023 6:56:20 AM

by sbux • New Contributor

03-02-2023 4:45:51 PM

2881 Views
2 replies
0 kudos

What is the best practice for tracing databricks - observe and writestream data record flow

Trying to connect dots on method below through a new event on Azure eventhub, storage, partition, avro records (those I can monitor) to my delta table? How do I trace observe, writeStream and the trigger? ... elif TABLE_TYPE == "live": print("D...

Data Engineering

2881 Views
2 replies
0 kudos

03-02-2023 4:45:51 PM

View Replies

Latest Reply

Anonymous
Not applicable

03-30-2023 1:55:17 AM

0 kudos

Hi @David Martin Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...

0 kudos

03-30-2023 1:55:17 AM

1 More Replies

by KVNARK • Honored Contributor II

01-30-2023 7:56:46 PM

15122 Views
2 replies
5 kudos

Resolved! pyspark optimizations and best practices

What and all we can implement maximum to attain the best optimization and which are all the best practices using PySpark end to end.

Data Engineering

15122 Views
2 replies
5 kudos

01-30-2023 7:56:46 PM

View Replies

Latest Reply

daniel_sahal
Esteemed Contributor

01-30-2023 11:55:41 PM

5 kudos

@KVNARK . This video is cool.https://www.youtube.com/watch?v=daXEp4HmS-E

5 kudos

01-30-2023 11:55:41 PM

1 More Replies

by alvaro_databric • New Contributor III

01-17-2023 8:06:05 AM

1204 Views
1 replies
0 kudos

Relation between Driver and Executor size

HiI would like to ask for recommendations regarding the size of the driver and the amount of executors managed by that driver. I am aware of the best practices regarding executor size/number but I have doubts about the number of executors a single dr...

Data Engineering

1204 Views
1 replies
0 kudos

01-17-2023 8:06:05 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-17-2023 8:29:21 AM

0 kudos

Depends on your use case. The best is to connect Datatog and see driver and workers utilization https://docs.datadoghq.com/integrations/databricks/?tab=driveronlyJust from my experience, Usually, for big datasets, when need autoscale workers between ...

0 kudos

01-17-2023 8:29:21 AM

by Vik1 • New Contributor II

06-22-2022 5:57:03 AM

9631 Views
3 replies
5 kudos

Some very simple functions in Pandas on Spark are very slow

I have a pandas on spark dataframe with 8 million rows and 20 columns. It took 3.48 minutes to run df.shape and it takes. It also takes a long time to run df.head took 4.55 minutes . By contrast df.var1.value_counts().reset_index() took only 0.18 sec...

Data Engineering

9631 Views
3 replies
5 kudos

06-22-2022 5:57:03 AM

View Replies

Latest Reply

PeterDowdy
New Contributor II

01-12-2023 4:36:35 PM

5 kudos

The reason why this is slow is because pandas needs an index column to perform `shape` or `head`. If you don't provide one, pyspark pandas enumerates the entire dataframe to create a default one. For example, given columns A, B, and C in dataframe `d...

5 kudos

01-12-2023 4:36:35 PM

2 More Replies

by jeremy1 • New Contributor II

05-17-2022 12:57:47 PM

13121 Views
9 replies
7 kudos

DLT and Modularity (best practices?)

I have [very] recently started using DLT for the first time. One of the challenges I have run into is how to include other "modules" within my pipelines. I missed the documentation where magic commands (with the exception of %pip) are ignored and was...

Data Engineering

13121 Views
9 replies
7 kudos

05-17-2022 12:57:47 PM

View Replies

Latest Reply

Greg_Galloway
New Contributor III

10-20-2022 11:14:13 AM

7 kudos

I like the approach @Arvind Ravish shared since you can't currently use %run in DLT pipelines. However, it took a little testing to be clear on how exactly to make it work. First, ensure in the Admin Console that the repos feature is configured as f...

7 kudos

10-20-2022 11:14:13 AM

8 More Replies

by Spauk • New Contributor II

01-03-2023 5:38:28 AM

19869 Views
5 replies
7 kudos

Resolved! Best Practices for naming Tables and Databases in Databricks

We moved in Databricks since few months from now, and before that we were in SQL Server.So, all our tables and databases follow the "camel case" rule.Apparently, in Databricks the rule is "lower case with underscore".Where can we find an official doc...

Data Engineering

19869 Views
5 replies
7 kudos

01-03-2023 5:38:28 AM

View Replies

Latest Reply

LandanG
Databricks Employee

01-03-2023 7:09:24 AM

7 kudos

Hi @Salah KHALFALLAH , looking at the documentation it appears that Databricks' preferred naming convention is lowercase and underscores as you mentioned.The reason for this is most likely because Databricks uses Hive Metastore, which is case insens...

7 kudos

01-03-2023 7:09:24 AM

4 More Replies

by User16844487905 • New Contributor III

10-14-2021 3:31:45 PM

4595 Views
4 replies
5 kudos

AWS quickstart - Cloudformation failure When deploying your workspace with the recommended AWS quickstart method, a Cloudformation template will be la...

AWS quickstart - Cloudformation failureWhen deploying your workspace with the recommended AWS quickstart method, a Cloudformation template will be launched in your AWS account. If you experience a failure with the error message along the lines of ROL...

Data Engineering

4595 Views
4 replies
5 kudos

10-14-2021 3:31:45 PM

View Replies

Latest Reply

yalun
New Contributor III

11-24-2022 7:11:54 AM

5 kudos

How do I launch the "Quickstart" again? Where is it in the console?

5 kudos

11-24-2022 7:11:54 AM

3 More Replies

by spott_submittab • New Contributor II

09-24-2021 3:11:50 PM

6385 Views
7 replies
9 kudos

How are people developing python libraries for use within a team on databricks?

In the past, before databricks, I would try and pull commonly used functions and features out of notebooks and save them in a python library that the whole team would work on and develop. This allowed for good code reuse and maintaining best practic...

Data Engineering

6385 Views
7 replies
9 kudos

09-24-2021 3:11:50 PM

View Replies

Latest Reply

dazfuller
Contributor III

09-27-2021 12:53:07 AM

9 kudos

The way we do this is to package as much re-usable code up into a common library as possible and then test it to within an inch of it's life with unit tests (I tend to use unittest for lower barrier to entry, but which ever framework works best for y...

9 kudos

09-27-2021 12:53:07 AM

6 More Replies

by Chris_Shehu • Valued Contributor III

10-21-2022 6:17:42 AM

4637 Views
1 replies
5 kudos

Resolved! Getting errors while following Microsoft Databricks Best-Practices for DevOps Integration

I'm currently trying to follow the Software engineering best practices for notebooks - Azure Databricks guide but I keep running into the following during step 4.5: Run the test============================= test session starts =======================...

Data Engineering

4637 Views
1 replies
5 kudos

10-21-2022 6:17:42 AM

View Replies

Latest Reply

Chris_Shehu
Valued Contributor III

10-24-2022 11:09:38 AM

5 kudos

Closing the loop on this in case anyone gets stuck in the same situation. You can see in the images that the transforms_test.py shows a different icon then the testdata.csv. This is because it was saved as a juypter notebook not a .py file. When the ...

5 kudos

10-24-2022 11:09:38 AM