cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

yvuignie
by Contributor
  • 7163 Views
  • 12 replies
  • 3 kudos

Resolved! Unity catalog - How do you modify groups properly ?

Hello,What is the best practice to modify/delete/recreate groups properly ?In order to rename a group, the only mean was to delete/recreate. But after deletion in the account console, the permissions granted to the deleted groups in the tables were i...

  • 7163 Views
  • 12 replies
  • 3 kudos
Latest Reply
RobinK
Contributor
  • 3 kudos

Hello,I have exactly the same issue - I am also using terraform.I deleted a group and the catalog permissions are in bad state.  I am not able to revoke access to this group using the Databricks UI nor REST API. I also tried to recreate the group wit...

  • 3 kudos
11 More Replies
Mr__D
by New Contributor II
  • 16083 Views
  • 7 replies
  • 1 kudos

Resolved! Writing modular code in Databricks

Hi All, Could you please suggest to me the best way to write PySpark code in Databricks,I don't want to write my code in Databricks notebook but create python files(modular project) in Vscode and call only the primary function in the notebook(the res...

  • 16083 Views
  • 7 replies
  • 1 kudos
Latest Reply
Gamlet
New Contributor II
  • 1 kudos

Certainly! To write PySpark code in Databricks while maintaining a modular project in VSCode, you can organize your PySpark code into Python files in VSCode, with a primary function encapsulating the main logic. Then, upload these files to Databricks...

  • 1 kudos
6 More Replies
andrew0117
by Contributor
  • 1083 Views
  • 1 replies
  • 0 kudos

what is best practice to handle the concurrency issue in batch processing?

Normally, our ELT framework takes in batches one by one and loads the data into target tables. But if more than one batches come in at the same time, the framework will break due to the concurrency issue that multiple sources are trying to write the ...

  • 1083 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

you can partition you table to avoid the changes of getting this exception.

  • 0 kudos
Anonymous
by Not applicable
  • 9355 Views
  • 3 replies
  • 0 kudos

Resolved! What is the best practice for managing different environments (Staging vs Production) on Databricks?

Should we create separate workspaces for Dev/Test/Prod ? Or should we have 1 workspace and create separate folders for Dev/Test/Prod?

  • 9355 Views
  • 3 replies
  • 0 kudos
Latest Reply
Srikanth_Gupta_
Valued Contributor
  • 0 kudos

as per my previous experience, its always good to have different workspaces for different envs, its easy to maintain and helps better with CICD pipeline as well, because lot of organizations provide deployment access to Developers in Dev env but not ...

  • 0 kudos
2 More Replies
Trung
by Contributor
  • 1663 Views
  • 4 replies
  • 0 kudos

best practice for managing resources to avoid creators leaving out of the project

Dear DB community!As I know, when the resource creator is out of the project all resources they create will be deleted as well. So the question is that: can I assign the owner role to a group, that will help protect the resource from deletion or not...

  • 1663 Views
  • 4 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @trung nguyen​ Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedback...

  • 0 kudos
3 More Replies
chanansh
by Contributor
  • 1664 Views
  • 2 replies
  • 0 kudos

Delta table acceleration for group by on key columns using ZORDER does not work

What is the best practice for accelerating queries which looks like the following?win = Window.partitionBy('key1','key2').orderBy('timestamp') df.select('timestamp', (F.col('col1') - F.lag('col1').over(win)).alias('col1_diff'))I have tried to use OP...

  • 1664 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Hanan Shteingart​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answ...

  • 0 kudos
1 More Replies
flobib123
by New Contributor III
  • 6247 Views
  • 5 replies
  • 3 kudos

Resolved! Syntax highlight support for python multiline SQL strings

I would like to know if a tool is available in Databricks to manage the SQL syntax highlight in text.Like this VSCode plugin:https://marketplace.visualstudio.com/items?itemName=ptweir.python-string-sqlThank you.

  • 6247 Views
  • 5 replies
  • 3 kudos
Latest Reply
flobib123
New Contributor III
  • 3 kudos

Hello @Vidula Khanna​ ,No, no one answered my question correctly, but I'm using PySpark now, so I'm not on this topic anymore.But thank you for taking the time to answer me.

  • 3 kudos
4 More Replies
AJDJ
by New Contributor III
  • 2767 Views
  • 2 replies
  • 6 kudos

Cost as per the Databricks demo

Hi there,I came across this Databricks demo from the below link. https://youtu.be/BqB7YQ1-KKcKindly Fastforward to time 16:30 or 16:45 of the video and watch few mins of the video related to cost. My understanding is the data is in the lake and datab...

  • 2767 Views
  • 2 replies
  • 6 kudos
Latest Reply
Anonymous
Not applicable
  • 6 kudos

Hi @AJ DJ​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thanks!

  • 6 kudos
1 More Replies
Gim
by Contributor
  • 63463 Views
  • 3 replies
  • 9 kudos

Best practice for logging in Databricks notebooks?

What is the best practice for logging in Databricks notebooks? I have a bunch of notebooks that run in parallel through a workflow. I would like to keep track of everything that happens such as errors coming from a stream. I would like these logs to ...

  • 63463 Views
  • 3 replies
  • 9 kudos
Latest Reply
karthik_p
Esteemed Contributor
  • 9 kudos

@Gimwell Young​ AS @Debayan Mukherjee​ mentioned if you configure verbose logging in workspace level, logs will be moved to your storage bucket that you have provided during configuration. from there you can pull logs into any of your licensed log mo...

  • 9 kudos
2 More Replies
alejandrofm
by Valued Contributor
  • 2427 Views
  • 4 replies
  • 2 kudos

Resolved! Orphan (?) files on Databricks S3 bucket

Hi, I'm seeing a lot of empty (and not) directories on routes like:xxxxxx.jobs/FileStore/job-actionstats/xxxxxx.jobs/FileStore/job-result/xxxxxx.jobs/command-results/Can I create a lifecycle to delete old objects (files/directories)? how many days? w...

  • 2427 Views
  • 4 replies
  • 2 kudos
Latest Reply
alejandrofm
Valued Contributor
  • 2 kudos

Hi! I didn't know that, Purging right now, is there a way to schedule that so logs are retained for less time? Maybe I want to maintain the last 7 days for everything?Thanks!

  • 2 kudos
3 More Replies
j02424
by New Contributor
  • 3072 Views
  • 1 replies
  • 4 kudos

Best practice to delete /dbfs/tmp ?

What is best practice regarding the tmp folder? We have a very large amount of data in that folder and not sure whether to delete, back up etc?

  • 3072 Views
  • 1 replies
  • 4 kudos
Latest Reply
Debayan
Databricks Employee
  • 4 kudos

/dbfs/tmp can contain a lot of files including temporary system files used for intermediary calculations or other sub directories which can contain packages of user defined installations. It is always better to backup the files.

  • 4 kudos
palzor
by New Contributor III
  • 867 Views
  • 0 replies
  • 2 kudos

What is the best practice while loading delta table , do I infer the schema or provide the schema?

I am loading avro files into the detla tables. I am doing this for multiple tables and some files are big like (2-3GB) and most of them are small like in few MBs.I am using autoloader to load the data into the delta tables.My question is:What is the ...

  • 867 Views
  • 0 replies
  • 2 kudos
shawncao
by New Contributor II
  • 1961 Views
  • 3 replies
  • 2 kudos

best practice of using data bricks API

Hello, I'm building a Databricks connector to allow users to issue command/SQL from a web app.In general, I think the REST API is okay to work with, though it's pretty tedious to write wrap code for each API call.[Q1]Is there an official (or semi-off...

  • 1961 Views
  • 3 replies
  • 2 kudos
Latest Reply
shawncao
New Contributor II
  • 2 kudos

I don't know if I fully understand DBX, sounds like a job client to manage jobs and deployment and I don't see NodeJS support for this project yet. My question was about how to "stream" query results back from Databricks in a NodeJs application, curr...

  • 2 kudos
2 More Replies
Labels