cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

User16826992666
by Valued Contributor
  • 1492 Views
  • 1 replies
  • 0 kudos

Resolved! When should I turn on multi-cluster load balancing on SQL Endpoints?

I see the option to enable multi-cluster load balancing when creating a SQL Endpoint, but I don't know if I should be using it or not. How do I know when I should enable it?

  • 1492 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

It is best to enable multi-cluster load balance on sql endpoints when a lot of users will be running queries concurrently. Load balancing will help isolate the queries and ensure the best performance for all users. If you only have a few users runnin...

  • 0 kudos
User16856693631
by New Contributor II
  • 4849 Views
  • 1 replies
  • 0 kudos
  • 4849 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16856693631
New Contributor II
  • 0 kudos

Yes you can. Databricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs, Databricks recommends that you export results before they expire. For more information, see https://docs.databricks.com/jobs.html#export...

  • 0 kudos
User16826992666
by Valued Contributor
  • 1197 Views
  • 1 replies
  • 0 kudos

Resolved! How much space does the metadata for a Delta table take up?

If you have a lot of transactions in a table it seems like the Delta log keeping track of all those transactions would get pretty large. Does the size of the metadata become a problem over time?

  • 1197 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

Yes, the size of the metadata can become a problem over time but not because of performance but because of storage costs. Delta performance will not degrade due to the size of the metadata, but your cloud storage bill can increase. By default Delta h...

  • 0 kudos
Anonymous
by Not applicable
  • 925 Views
  • 1 replies
  • 0 kudos

Resolved! Delta Sharing internally?

If we don't have any datasets to be shared with external companies, does that mean Delta Sharing is not valid for our org? Is there any use case to use it internally?

  • 925 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

Delta sharing can be done externally and internally. One use case for sharing internally would be if two separate business units would like to share data with each other without exposing their Lakehouse with the other unit.

  • 0 kudos
User16830818524
by New Contributor II
  • 935 Views
  • 1 replies
  • 0 kudos

Is it possible to read a Delta table directly using Koalas?

Can I read a Delta table directly using Koalas or do I need to read using Spark and then convert the Spark dataframe to a Koalas dataframe?

  • 935 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

Yes, you can use the "read_delta" function. Documentation.

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 1229 Views
  • 1 replies
  • 2 kudos

Resolved! Unable to get mlflow central model registry to work with dbconnect.

I'm working on setting up tooling to allow team members to easily register and load models from a central mlflow model registry via dbconnect. However after following the instructions at the public docs , hitting this error raise _NoDbutilsError mlfl...

  • 1229 Views
  • 1 replies
  • 2 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 2 kudos

You could monkey patch MLFlow's _get_dbutils() with something similar to this to get this working while connecting from dbconnectspark = SparkSession.builder.getOrCreate() # monkey-patch MLFlow's _get_dbutils() def _get_dbutils(): return DBUtils(...

  • 2 kudos
aladda
by Honored Contributor II
  • 937 Views
  • 1 replies
  • 0 kudos
  • 937 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

Generally, interactive clusters and jobs are better suited for data engineering and transformations as they support more than just SQL. However, if you are using pure SQL, then endpoints can be used for data transformations. All of the Spark SQL fun...

  • 0 kudos
aladda
by Honored Contributor II
  • 888 Views
  • 1 replies
  • 0 kudos

Resolved! Does the Jobs API allow executing an older version of a Notebook using version history?

I see the revision_timestamp paramater on NotebookTask https://docs.databricks.com/dev-tools/api/latest/jobs.html#jobsnotebooktask. An example of how to invoke it would be helpful

  • 888 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

You can use the databricks built in version control feature, coupled with the NotebookTask Jobs API to specify a specific version of the notebook based on the timestamp of the save defined in unix timestamp formatcurl -n -X POST -H 'Content-Type: app...

  • 0 kudos
User16826992666
by Valued Contributor
  • 1197 Views
  • 1 replies
  • 0 kudos

How do I know if the number of files are causing performance issues?

I have read and heard that having too many small files can cause performance problems when reading large data sets. But how do I know if that is an issue I am facing?

  • 1197 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Databricks SQL endpoint has a query history section which provides additional information to debug / tune queries. One such metric under execution details is the number of files read. For ETL/Data science workloads, you could use the Spark UI of the ...

  • 0 kudos
User16765131552
by Contributor III
  • 1914 Views
  • 1 replies
  • 1 kudos

Displaying spark job process in dashboard

In databricks is there a way to display the spark job process in a dashboard? I have a simple dashboard that displays a table, but the main spark job behind it takes 15 minutes to run. Is there a way to show the spark job progress bar in a dashboard?

  • 1914 Views
  • 1 replies
  • 1 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 1 kudos

The best way to do so would be to collect data about the job run using the REST API (runs get endpoint). This endpoint provides as much metadata as possible. You may need to use other endpoints to get the job or run ids in order to get the correct in...

  • 1 kudos
User16826992666
by Valued Contributor
  • 1644 Views
  • 1 replies
  • 0 kudos

Resolved! When running a Merge, if records from the table are deleted are the underlying files that contain the records deleted as well?

I know I have the option to delete rows from a Delta table when running a merge. But I'm confused about how that would actually affect the files that contain the deleted records. Are those files deleted, or are they rewritten, or what?

  • 1644 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Delta implements MERGE by physically rewriting existing files. It is implemented  in two steps.Perform an inner join between the target table and source table to select all files that have matches.Perform an outer join between the selected files in t...

  • 0 kudos
User16826992666
by Valued Contributor
  • 1176 Views
  • 1 replies
  • 0 kudos

Resolved! Are Delta tables able to support GDPR compliance?

I know that when deletes are made from a Delta table the underlying files are not actually removed. For compliance reasons I need to able to truly delete the records. How can I know which files need to be removed, and is there a way to remove them ot...

  • 1176 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Here is a document explaining best practices for GDPR and CCPA compliance using Delta Lake. Specifically on cleaning up stale data - you can use the VACUUM function to remove files that are no longer referenced by a Delta table and are older than a s...

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels
Top Kudoed Authors