cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

jose_gonzalez
by Databricks Employee
  • 2662 Views
  • 1 replies
  • 0 kudos

Resolved! how to troubleshot Python version mismatch error in DBconnect?

Im getting some weird messages when trying to run my Dbconnect. I would like to know if there is a troubleshooting guide to solve Python version mismatch errors.

  • 2662 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

We have a troubleshooting section in our docs that could help you to solve this issue. Please check the docs here https://docs.databricks.com/dev-tools/databricks-connect.html#python-version-mismatch

  • 0 kudos
jose_gonzalez
by Databricks Employee
  • 2204 Views
  • 1 replies
  • 0 kudos

Resolved! can I use Dbconnect for my structured streaming jobs?

I would like to know if I can use Dbconnect to run all my structured streaming jobs.

  • 2204 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

Unfortunately, no. You cannot use Dbconnect for your streaming jobs. This is one of Dbconnect's limitations. For more details please check the docs: https://docs.databricks.com/dev-tools/databricks-connect.html#limitations

  • 0 kudos
User16826992666
by Valued Contributor
  • 2586 Views
  • 1 replies
  • 0 kudos

Resolved! How often should I run OPTIMIZE on my Delta Tables?

I know it's important to periodically run Optimize on my Delta tables, but how often should I be doing this? Am I supposed to do this after every time I load data?

  • 2586 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

It would depend on how frequently you update the table and how often you read it. If you have a daily ETL job updating a delta table, it might make sense to run OPTIMIZE at the end of it so that subsequent reads would benefit from the performance imp...

  • 0 kudos
User16826992666
by Valued Contributor
  • 3531 Views
  • 1 replies
  • 0 kudos

Resolved! How do I know which worker type to choose when creating my cluster?

I am new to using Databricks and want to create a cluster, but there are many different worker types to choose from. How do I know which worker type is the right type for my use case?

  • 3531 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

For delta workloads, where you could benefit from caching it is recommended to use storage optimized instances that come with NVMe SSDs. For other workloads, it would be a good idea to check Ganglia metrics to see whether your workload is Cpu/Memory ...

  • 0 kudos
User16826992666
by Valued Contributor
  • 2217 Views
  • 2 replies
  • 1 kudos

Can you run non-spark jobs on Databricks?

Is spark the only type of code that can run on a Databricks cluster?

  • 2217 Views
  • 2 replies
  • 1 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 1 kudos

Databricks has a Runtime for Machine Learning that comes with a lot of libraries/frameworks pre-installed. This allows you to run for example PyTorch / TensorFlow code without worrying about infrastructure setup, configuration and dependency manage...

  • 1 kudos
1 More Replies
User16826992666
by Valued Contributor
  • 1464 Views
  • 1 replies
  • 0 kudos

Resolved! What options do I have for controlling end user access to data?

For security and privacy reasons I need to limit what datasets are available for access by end users. How can I accomplish this in a Databricks workspace?

  • 1464 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Unity Catalog is the recommended approach as it lets you manage fine-grained data permissions using standard ANSI SQL / UI . More details could be found here

  • 0 kudos
User16826992666
by Valued Contributor
  • 2559 Views
  • 2 replies
  • 0 kudos

Resolved! When should I set the cluster mode to High Concurrency vs Standard?

How do I know which mode I should be using when creating a cluster?

  • 2559 Views
  • 2 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

High Concurrency clusters are ideal for groups of users who need to share resources or run ad-hoc jobs - for example data scientists sharing a cluster. They come with Query Watchdog, a process which keeps disruptive queries in check by automatically ...

  • 0 kudos
1 More Replies
User16826992666
by Valued Contributor
  • 2281 Views
  • 1 replies
  • 1 kudos

Resolved! How long does the automatic notebook Revision History store the changes?

I am wondering how far back I can restore old versions of my notebook.

  • 2281 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16137833804
Databricks Employee
  • 1 kudos

I believe it stores since the beginning of the creation of the notebook assuming the revision history doesn't get cleared.

  • 1 kudos
User16826989884
by New Contributor
  • 1654 Views
  • 1 replies
  • 0 kudos

Chargeback in Azure Databricks

What is the best way to monitor consumption and cost in Azure Databricks? Ultimate goal is to allocate consumption by team/workspace

  • 1654 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

If your goal is to charge back other teams or business units based on consumption then you should enforce tags on all clusters/compute. These tags will show up on your Azure bill which you can then identify which groups used which resources.

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 2481 Views
  • 1 replies
  • 0 kudos
  • 2481 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

If you are using pools then you should consider keeping a min idle count of machines greater than 2. This will allow you to have machines available and ready to use. If you have 0 machines on idle then the first job executed against the pool will hav...

  • 0 kudos
User16826992783
by New Contributor II
  • 1687 Views
  • 1 replies
  • 1 kudos

Receiving a "Databricks Delta is not enabled on your account" error

The team is using Databricks Light for some pipeline development and would like to leverage Delta but are running into this error? "Databricks Delta is not enabled on your account"How can we enable Delta for our account

  • 1687 Views
  • 1 replies
  • 1 kudos
Latest Reply
craig_ng
New Contributor III
  • 1 kudos

Databricks Light is the open source Apache Spark runtime and does not come with any type of client for Delta Lake pre-installed. You'll need to manually install open source Delta Lake in order to do any reads or writes.See our docs and release notes ...

  • 1 kudos
Anonymous
by Not applicable
  • 1890 Views
  • 1 replies
  • 0 kudos
  • 1890 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

Delta Lake uses optimistic concurrency control to provide transactional guarantees between writes. Under this mechanism, writes operate in three stages:Read: Reads (if needed) the latest available version of the table to identify which files need to ...

  • 0 kudos
Anonymous
by Not applicable
  • 1287 Views
  • 1 replies
  • 0 kudos

Resolved! Is it possible to have time travel capability but also be able to selectively vacuum ?

I would like to have time travel functionality for several months but that ends up adding up storage costs. Is there some way to have mix of vacuum and time travel?

  • 1287 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

There is not a way to time travel past the vacuum retention period. If you would like to time travel back lets say 3 months then you are not able to vacuum a shorter time frame.

  • 0 kudos
User16752241457
by New Contributor II
  • 1114 Views
  • 1 replies
  • 0 kudos

Overwriting Delta Table Using SQL

I have a delta table that is updated nightly, that I drop and recreate at the start of each day. However, this isn't ideal because every time I drop the table I lose all the info in the transaction log. Is there a way that I can do the equivalent of:...

  • 1114 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

I think you are looking for the INSERT OVERWRITE command in Spark SQL. Check out the documentation here: https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-dml-insert-overwrite-table.html

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels