cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Anonymous
by Not applicable
  • 6920 Views
  • 15 replies
  • 8 kudos

Resolved! What are some best practices for CICD?

A number of people have questions on using Databricks in a productionalized environment. What are the best practices to enable CICD automation?

  • 6920 Views
  • 15 replies
  • 8 kudos
Latest Reply
BaivabMohanty
New Contributor II
  • 8 kudos

Any leads/posts for Databricks CI/CD  integration with Bitbucket pipeline. I am facing the below error while I creation my CICD pipeline pipelines:branches:master:- step:name: Deploy Databricks Changesimage: docker:19.03.12services:- dockerscript:# U...

  • 8 kudos
14 More Replies
Smitha1
by Valued Contributor II
  • 2769 Views
  • 9 replies
  • 3 kudos

Databricks Certified Associate Developer for Apache Spark 3.0

Databricks Certified Associate Developer for Apache Spark 3.0

  • 2769 Views
  • 9 replies
  • 3 kudos
Latest Reply
Shivam_Patil
New Contributor II
  • 3 kudos

Hey I am looking for sample papers for the above exam other than the one provided by databricks do any one have any idea about it

  • 3 kudos
8 More Replies
User16776430979
by New Contributor III
  • 28410 Views
  • 5 replies
  • 6 kudos

Best practices around bronze/silver/gold (medallion model) data lake classification?

What's the best way to organize our data lake and delta setup? We’re trying to use the bronze, silver and gold classification strategy. The main question is how do we know what classification the data is inside Databricks if there’s no actual physica...

  • 28410 Views
  • 5 replies
  • 6 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 6 kudos

with Unity taking into account, it is certainly a good idea to think about your physical data storage.As you cannot have overlap between volumes and tables this can become cumbersome.F.e. we used to store delta tables of a data object in the same dir...

  • 6 kudos
4 More Replies
WillHeyer
by New Contributor II
  • 2941 Views
  • 1 replies
  • 2 kudos

Resolved! Best Practices for PowerBI Connectivity w/ Partner Connect. Access Token w/ Service Principal, Databricks Username w/ Service account, or OAuth?

I'm aware all are possible methods but are all equal? Or is the matter trivial? Thank you so much!

  • 2941 Views
  • 1 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Will Heyer​ :The best method for Power BI connectivity with Partner Connect depends on your specific use case and requirements. Here are some factors to consider for each method:Access Token with Service Principal: This method uses a client ID and s...

  • 2 kudos
Anonymous
by Not applicable
  • 1599 Views
  • 3 replies
  • 2 kudos

www.dbdemos.ai

Getting started with Databricks is being made very easy now. Presenting dbdemos.If you're looking to get started with Databricks, there's good news: dbdemos makes it easier than ever. This platform offers a range of demos that you can install directl...

  • 1599 Views
  • 3 replies
  • 2 kudos
Latest Reply
FJ
Contributor III
  • 2 kudos

That's a great share Suteja. Is that supposed to work with the Databricks Community edition account? Had a strange error while trying. Any help is appreciated!Thanks,F

  • 2 kudos
2 More Replies
Phani1
by Valued Contributor
  • 2681 Views
  • 1 replies
  • 0 kudos

best practices/steps for hive meta store backup and restore.

Hi Team,Could you share with us the best practices/steps for hive meta store backup and restore?Regards,Phanindra

  • 2681 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Janga Reddy​ :Certainly! Here are the steps for Hive metastore backup and restore on Databricks:Backup:Stop all running Hive services and jobs on the Databricks cluster.Create a backup directory in DBFS (Databricks File System) where the metadata fi...

  • 0 kudos
sbux
by New Contributor
  • 1665 Views
  • 2 replies
  • 0 kudos

What is the best practice for tracing databricks - observe and writestream data record flow

Trying to connect dots on method below through a new event on Azure eventhub, storage, partition, avro records (those I can monitor) to my delta table? How do I trace observe, writeStream and the trigger? ... elif TABLE_TYPE == "live": print("D...

  • 1665 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @David Martin​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...

  • 0 kudos
1 More Replies
KVNARK
by Honored Contributor II
  • 8525 Views
  • 2 replies
  • 5 kudos

Resolved! pyspark optimizations and best practices

What and all we can implement maximum to attain the best optimization and which are all the best practices using PySpark end to end.

  • 8525 Views
  • 2 replies
  • 5 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 5 kudos

@KVNARK .​  This video is cool.https://www.youtube.com/watch?v=daXEp4HmS-E

  • 5 kudos
1 More Replies
alvaro_databric
by New Contributor III
  • 684 Views
  • 1 replies
  • 0 kudos

Relation between Driver and Executor size

HiI would like to ask for recommendations regarding the size of the driver and the amount of executors managed by that driver. I am aware of the best practices regarding executor size/number but I have doubts about the number of executors a single dr...

  • 684 Views
  • 1 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 0 kudos

Depends on your use case. The best is to connect Datatog and see driver and workers utilization https://docs.datadoghq.com/integrations/databricks/?tab=driveronlyJust from my experience, Usually, for big datasets, when need autoscale workers between ...

  • 0 kudos
Vik1
by New Contributor II
  • 7220 Views
  • 4 replies
  • 5 kudos

Some very simple functions in Pandas on Spark are very slow

I have a pandas on spark dataframe with 8 million rows and 20 columns. It took 3.48 minutes to run df.shape and it takes. It also takes a long time to run df.head took 4.55 minutes . By contrast df.var1.value_counts().reset_index() took only 0.18 sec...

  • 7220 Views
  • 4 replies
  • 5 kudos
Latest Reply
PeterDowdy
New Contributor II
  • 5 kudos

The reason why this is slow is because pandas needs an index column to perform `shape` or `head`. If you don't provide one, pyspark pandas enumerates the entire dataframe to create a default one. For example, given columns A, B, and C in dataframe `d...

  • 5 kudos
3 More Replies
jeremy1
by New Contributor II
  • 7078 Views
  • 10 replies
  • 7 kudos

DLT and Modularity (best practices?)

I have [very] recently started using DLT for the first time. One of the challenges I have run into is how to include other "modules" within my pipelines. I missed the documentation where magic commands (with the exception of %pip) are ignored and was...

  • 7078 Views
  • 10 replies
  • 7 kudos
Latest Reply
Greg_Galloway
New Contributor III
  • 7 kudos

I like the approach @Arvind Ravish​ shared since you can't currently use %run in DLT pipelines. However, it took a little testing to be clear on how exactly to make it work. First, ensure in the Admin Console that the repos feature is configured as f...

  • 7 kudos
9 More Replies
Spauk
by New Contributor II
  • 9788 Views
  • 5 replies
  • 7 kudos

Resolved! Best Practices for naming Tables and Databases in Databricks

We moved in Databricks since few months from now, and before that we were in SQL Server.So, all our tables and databases follow the "camel case" rule.Apparently, in Databricks the rule is "lower case with underscore".Where can we find an official doc...

  • 9788 Views
  • 5 replies
  • 7 kudos
Latest Reply
LandanG
Honored Contributor
  • 7 kudos

Hi @Salah KHALFALLAH​ , looking at the documentation it appears that Databricks' preferred naming convention is lowercase and underscores as you mentioned.The reason for this is most likely because Databricks uses Hive Metastore, which is case insens...

  • 7 kudos
4 More Replies
User16844487905
by New Contributor III
  • 3048 Views
  • 4 replies
  • 5 kudos

AWS quickstart - Cloudformation failure When deploying your workspace with the recommended AWS quickstart method, a Cloudformation template will be la...

AWS quickstart - Cloudformation failureWhen deploying your workspace with the recommended AWS quickstart method, a Cloudformation template will be launched in your AWS account. If you experience a failure with the error message along the lines of ROL...

Screen Shot 2021-10-12 at 11.46.28 AM Screen Shot 2021-10-13 at 3.09.01 PM
  • 3048 Views
  • 4 replies
  • 5 kudos
Latest Reply
yalun
New Contributor III
  • 5 kudos

How do I launch the "Quickstart" again? Where is it in the console?

  • 5 kudos
3 More Replies
spott_submittab
by New Contributor II
  • 4056 Views
  • 9 replies
  • 10 kudos

How are people developing python libraries for use within a team on databricks?

In the past, before databricks, I would try and pull commonly used functions and features out of notebooks and save them in a python library that the whole team would work on and develop. This allowed for good code reuse and maintaining best practic...

  • 4056 Views
  • 9 replies
  • 10 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 10 kudos

Hi @Andrew Spott​ , Just a friendly follow-up. Do you still need help or the above responses help you to find the solution? Please let us know.

  • 10 kudos
8 More Replies
Chris_Shehu
by Valued Contributor III
  • 2877 Views
  • 1 replies
  • 5 kudos

Resolved! Getting errors while following Microsoft Databricks Best-Practices for DevOps Integration

I'm currently trying to follow the Software engineering best practices for notebooks - Azure Databricks guide but I keep running into the following during step 4.5: Run the test============================= test session starts =======================...

image.png image image image
  • 2877 Views
  • 1 replies
  • 5 kudos
Latest Reply
Chris_Shehu
Valued Contributor III
  • 5 kudos

Closing the loop on this in case anyone gets stuck in the same situation. You can see in the images that the transforms_test.py shows a different icon then the testdata.csv. This is because it was saved as a juypter notebook not a .py file. When the ...

  • 5 kudos
Labels