Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
What's the best way to organize our data lake and delta setup? We’re trying to use the bronze, silver and gold classification strategy. The main question is how do we know what classification the data is inside Databricks if there’s no actual physica...
Any leads/posts for Databricks CI/CD integration with Bitbucket pipeline. I am facing the below error while I creation my CICD pipeline pipelines:branches:master:- step:name: Deploy Databricks Changesimage: docker:19.03.12services:- dockerscript:# U...
@Will Heyer :The best method for Power BI connectivity with Partner Connect depends on your specific use case and requirements. Here are some factors to consider for each method:Access Token with Service Principal: This method uses a client ID and s...
Getting started with Databricks is being made very easy now. Presenting dbdemos.If you're looking to get started with Databricks, there's good news: dbdemos makes it easier than ever. This platform offers a range of demos that you can install directl...
That's a great share Suteja. Is that supposed to work with the Databricks Community edition account? Had a strange error while trying. Any help is appreciated!Thanks,F
@Janga Reddy :Certainly! Here are the steps for Hive metastore backup and restore on Databricks:Backup:Stop all running Hive services and jobs on the Databricks cluster.Create a backup directory in DBFS (Databricks File System) where the metadata fi...
Trying to connect dots on method below through a new event on Azure eventhub, storage, partition, avro records (those I can monitor) to my delta table? How do I trace observe, writeStream and the trigger? ...
elif TABLE_TYPE == "live":
print("D...
Hi @David Martin Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...
HiI would like to ask for recommendations regarding the size of the driver and the amount of executors managed by that driver. I am aware of the best practices regarding executor size/number but I have doubts about the number of executors a single dr...
Depends on your use case. The best is to connect Datatog and see driver and workers utilization https://docs.datadoghq.com/integrations/databricks/?tab=driveronlyJust from my experience, Usually, for big datasets, when need autoscale workers between ...
I have a pandas on spark dataframe with 8 million rows and 20 columns. It took 3.48 minutes to run df.shape and it takes. It also takes a long time to run df.head took 4.55 minutes . By contrast df.var1.value_counts().reset_index() took only 0.18 sec...
The reason why this is slow is because pandas needs an index column to perform `shape` or `head`. If you don't provide one, pyspark pandas enumerates the entire dataframe to create a default one. For example, given columns A, B, and C in dataframe `d...
I have [very] recently started using DLT for the first time. One of the challenges I have run into is how to include other "modules" within my pipelines. I missed the documentation where magic commands (with the exception of %pip) are ignored and was...
I like the approach @Arvind Ravish shared since you can't currently use %run in DLT pipelines. However, it took a little testing to be clear on how exactly to make it work. First, ensure in the Admin Console that the repos feature is configured as f...
We moved in Databricks since few months from now, and before that we were in SQL Server.So, all our tables and databases follow the "camel case" rule.Apparently, in Databricks the rule is "lower case with underscore".Where can we find an official doc...
Hi @Salah KHALFALLAH , looking at the documentation it appears that Databricks' preferred naming convention is lowercase and underscores as you mentioned.The reason for this is most likely because Databricks uses Hive Metastore, which is case insens...
AWS quickstart - Cloudformation failureWhen deploying your workspace with the recommended AWS quickstart method, a Cloudformation template will be launched in your AWS account. If you experience a failure with the error message along the lines of ROL...
In the past, before databricks, I would try and pull commonly used functions and features out of notebooks and save them in a python library that the whole team would work on and develop. This allowed for good code reuse and maintaining best practic...
The way we do this is to package as much re-usable code up into a common library as possible and then test it to within an inch of it's life with unit tests (I tend to use unittest for lower barrier to entry, but which ever framework works best for y...
I'm currently trying to follow the Software engineering best practices for notebooks - Azure Databricks guide but I keep running into the following during step 4.5: Run the test============================= test session starts =======================...
Closing the loop on this in case anyone gets stuck in the same situation. You can see in the images that the transforms_test.py shows a different icon then the testdata.csv. This is because it was saved as a juypter notebook not a .py file. When the ...