Data Engineering

Forum Posts

Sorted by:

Start a conversation

by User16826992783 • Databricks Employee

06-11-2021 7:30:59 AM

3906 Views
1 replies
0 kudos

Find Databricks SQL endpoints runtime

Is there a way to find out which runtime SQL endpoints are running?

Data Engineering

3906 Views
1 replies
0 kudos

06-11-2021 7:30:59 AM

View Replies

Latest Reply

Ryan_Chynoweth
Databricks Employee

06-18-2021 2:33:10 PM

0 kudos

In the UI, Databricks will list the running endpoints on top. Programmatically you can get information about the endpoints using the REST APIs. You will likely need to use a combo of the list endpoint to get all the endpoints. The for each endpoint u...

0 kudos

06-18-2021 2:33:10 PM

by craig_ng • Databricks Employee

06-18-2021 2:27:27 PM

1718 Views
1 replies
1 kudos

Can I provision users automatically through my identity provider?

Data Engineering

1718 Views
1 replies
1 kudos

06-18-2021 2:27:27 PM

View Replies

Latest Reply

craig_ng
Databricks Employee

06-18-2021 2:28:17 PM

1 kudos

Yes, you can use the SCIM API integration to provision both users and groups. We have examples for Okta, Azure AD and OneLogin, but any SCIM-enabled IdP should suffice.

1 kudos

06-18-2021 2:28:17 PM

by sajith_appukutt • Databricks Employee

06-09-2021 1:42:58 AM

2123 Views
1 replies
0 kudos

Resolved! Can I schedule Databricks pools to have different minimum idle instance counts at different times of the day

I have few jobs configured to run against a pool at 10 PM every night. After running some tests, I found that increasing minimum idle instance counts improves the job latencies. However, It wouldn't be needed to have so many VMs idle at other times...

Data Engineering

2123 Views
1 replies
0 kudos

06-09-2021 1:42:58 AM

View Replies

Latest Reply

Ryan_Chynoweth
Databricks Employee

06-18-2021 2:24:09 PM

0 kudos

Yes you can do so programmatically using the REST APIs. You can edit the settings of a Databricks Pool by using the Instance Pool Edit endpoint and provide the min idle that you desire. This cannot be done via the web UI.

0 kudos

06-18-2021 2:24:09 PM

by User16856693631 • Databricks Employee

06-18-2021 2:17:16 PM

1322 Views
1 replies
0 kudos

Does updating a Job via the REST API remove the job run history?

Data Engineering

1322 Views
1 replies
0 kudos

06-18-2021 2:17:16 PM

View Replies

Latest Reply

User16856693631
Databricks Employee

06-18-2021 2:17:46 PM

0 kudos

It does not. You will be able to retain past runs and view them for up to 60 days regardless if you updating the configuration of the job.

0 kudos

06-18-2021 2:17:46 PM

by User16826992666 • Databricks Employee

06-15-2021 8:54:58 PM

2619 Views
1 replies
0 kudos

Resolved! When should I turn on multi-cluster load balancing on SQL Endpoints?

I see the option to enable multi-cluster load balancing when creating a SQL Endpoint, but I don't know if I should be using it or not. How do I know when I should enable it?

Data Engineering

2619 Views
1 replies
0 kudos

06-15-2021 8:54:58 PM

View Replies

Latest Reply

Ryan_Chynoweth
Databricks Employee

06-18-2021 2:17:31 PM

0 kudos

It is best to enable multi-cluster load balance on sql endpoints when a lot of users will be running queries concurrently. Load balancing will help isolate the queries and ensure the best performance for all users. If you only have a few users runnin...

0 kudos

06-18-2021 2:17:31 PM

by User16856693631 • Databricks Employee

06-18-2021 2:14:21 PM

8189 Views
1 replies
0 kudos

Can I export the results of my job runs?

Data Engineering

8189 Views
1 replies
0 kudos

06-18-2021 2:14:21 PM

View Replies

Latest Reply

User16856693631
Databricks Employee

06-18-2021 2:14:48 PM

0 kudos

Yes you can. Databricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs, Databricks recommends that you export results before they expire. For more information, see https://docs.databricks.com/jobs.html#export...

0 kudos

06-18-2021 2:14:48 PM

by MoJaMa • Databricks Employee

06-18-2021 12:49:41 PM

1753 Views
2 replies
0 kudos

Does Delta refresh DF cache automatically after a delete?

Data Engineering

1753 Views
2 replies
0 kudos

06-18-2021 12:49:41 PM

View Replies

Latest Reply

Srikanth_Gupta_
Databricks Employee

06-18-2021 2:11:13 PM

0 kudos

How about updates and inserts? does it refresh automatically?

0 kudos

06-18-2021 2:11:13 PM

1 More Replies

by User16826992666 • Databricks Employee

06-16-2021 7:41:25 PM

2256 Views
1 replies
0 kudos

Resolved! How much space does the metadata for a Delta table take up?

If you have a lot of transactions in a table it seems like the Delta log keeping track of all those transactions would get pretty large. Does the size of the metadata become a problem over time?

Data Engineering

2256 Views
1 replies
0 kudos

06-16-2021 7:41:25 PM

View Replies

Latest Reply

Ryan_Chynoweth
Databricks Employee

06-18-2021 2:07:09 PM

0 kudos

Yes, the size of the metadata can become a problem over time but not because of performance but because of storage costs. Delta performance will not degrade due to the size of the metadata, but your cloud storage bill can increase. By default Delta h...

0 kudos

06-18-2021 2:07:09 PM

by Anonymous • Not applicable

06-17-2021 4:44:12 PM

1740 Views
1 replies
0 kudos

Resolved! Delta Sharing internally?

If we don't have any datasets to be shared with external companies, does that mean Delta Sharing is not valid for our org? Is there any use case to use it internally?

Data Engineering

1740 Views
1 replies
0 kudos

06-17-2021 4:44:12 PM

View Replies

Latest Reply

Ryan_Chynoweth
Databricks Employee

06-18-2021 2:03:54 PM

0 kudos

Delta sharing can be done externally and internally. One use case for sharing internally would be if two separate business units would like to share data with each other without exposing their Lakehouse with the other unit.

0 kudos

06-18-2021 2:03:54 PM

by User16830818524 • Databricks Employee

06-18-2021 11:25:20 AM

1724 Views
1 replies
0 kudos

Is it possible to read a Delta table directly using Koalas?

Can I read a Delta table directly using Koalas or do I need to read using Spark and then convert the Spark dataframe to a Koalas dataframe?

Data Engineering

1724 Views
1 replies
0 kudos

06-18-2021 11:25:20 AM

View Replies

Latest Reply

Ryan_Chynoweth
Databricks Employee

06-18-2021 2:02:26 PM

0 kudos

Yes, you can use the "read_delta" function. Documentation.

0 kudos

06-18-2021 2:02:26 PM

by sajith_appukutt • Databricks Employee

06-13-2021 4:48:27 PM

2164 Views
1 replies
2 kudos

Resolved! Unable to get mlflow central model registry to work with dbconnect.

I'm working on setting up tooling to allow team members to easily register and load models from a central mlflow model registry via dbconnect. However after following the instructions at the public docs , hitting this error raise _NoDbutilsError mlfl...

Data Engineering

2164 Views
1 replies
2 kudos

06-13-2021 4:48:27 PM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-18-2021 1:55:43 PM

2 kudos

You could monkey patch MLFlow's _get_dbutils() with something similar to this to get this working while connecting from dbconnectspark = SparkSession.builder.getOrCreate() # monkey-patch MLFlow's _get_dbutils() def _get_dbutils(): return DBUtils(...

2 kudos

06-18-2021 1:55:43 PM

by aladda • Databricks Employee

06-18-2021 12:27:07 PM

2195 Views
1 replies
0 kudos

Can i run data engineering/data transformation jobs using databricks SQL endpoints or are databricks clusters (interactive or jobs) the recommended way to do this?

Data Engineering

2195 Views
1 replies
0 kudos

06-18-2021 12:27:07 PM

View Replies

Latest Reply

Ryan_Chynoweth
Databricks Employee

06-18-2021 1:52:21 PM

0 kudos

Generally, interactive clusters and jobs are better suited for data engineering and transformations as they support more than just SQL. However, if you are using pure SQL, then endpoints can be used for data transformations. All of the Spark SQL fun...

0 kudos

06-18-2021 1:52:21 PM

by aladda • Databricks Employee

06-18-2021 11:52:48 AM

1613 Views
1 replies
0 kudos

Resolved! Does the Jobs API allow executing an older version of a Notebook using version history?

I see the revision_timestamp paramater on NotebookTask https://docs.databricks.com/dev-tools/api/latest/jobs.html#jobsnotebooktask. An example of how to invoke it would be helpful

Data Engineering

1613 Views
1 replies
0 kudos

06-18-2021 11:52:48 AM

View Replies

Latest Reply

aladda
Databricks Employee

06-18-2021 1:50:56 PM

0 kudos

You can use the databricks built in version control feature, coupled with the NotebookTask Jobs API to specify a specific version of the notebook based on the timestamp of the save defined in unix timestamp formatcurl -n -X POST -H 'Content-Type: app...

0 kudos

06-18-2021 1:50:56 PM

by User16826992666 • Databricks Employee

06-16-2021 9:42:52 AM

2569 Views
1 replies
0 kudos

How do I know if the number of files are causing performance issues?

I have read and heard that having too many small files can cause performance problems when reading large data sets. But how do I know if that is an issue I am facing?

Data Engineering

2569 Views
1 replies
0 kudos

06-16-2021 9:42:52 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-18-2021 1:47:00 PM

0 kudos

Databricks SQL endpoint has a query history section which provides additional information to debug / tune queries. One such metric under execution details is the number of files read. For ETL/Data science workloads, you could use the Spark UI of the ...

0 kudos

06-18-2021 1:47:00 PM

by User16765131552 • Databricks Employee

06-18-2021 1:09:30 PM

2864 Views
1 replies
1 kudos

Displaying spark job process in dashboard

In databricks is there a way to display the spark job process in a dashboard? I have a simple dashboard that displays a table, but the main spark job behind it takes 15 minutes to run. Is there a way to show the spark job progress bar in a dashboard?

Data Engineering

2864 Views
1 replies
1 kudos

06-18-2021 1:09:30 PM

View Replies

Latest Reply

Ryan_Chynoweth
Databricks Employee

06-18-2021 1:41:41 PM

1 kudos

The best way to do so would be to collect data about the job run using the REST API (runs get endpoint). This endpoint provides as much metadata as possible. You may need to use other endpoints to get the job or run ids in order to get the correct in...

1 kudos

06-18-2021 1:41:41 PM

Databricks Community

Forum Posts

Find Databricks SQL endpoints runtime

Can I provision users automatically through my identity provider?

Resolved! Can I schedule Databricks pools to have different minimum idle instance counts at different times of the day

Does updating a Job via the REST API remove the job run history?

Resolved! When should I turn on multi-cluster load balancing on SQL Endpoints?

Can I export the results of my job runs?

Does Delta refresh DF cache automatically after a delete?

Resolved! How much space does the metadata for a Delta table take up?

Resolved! Delta Sharing internally?

Is it possible to read a Delta table directly using Koalas?

Resolved! Unable to get mlflow central model registry to work with dbconnect.

Can i run data engineering/data transformation jobs using databricks SQL endpoints or are databricks clusters (interactive or jobs) the recommended way to do this?

Resolved! Does the Jobs API allow executing an older version of a Notebook using version history?

How do I know if the number of files are causing performance issues?

Displaying spark job process in dashboard

Join Us as a Local Community Builder!

Workflow scheduling on particular working day of t...

EXCEL_DATA_SOURCE_NOT_ENABLED Excel data source is...

What's the difference between dbmanagedidentity an...

How to stop Databricks retaining widget selection ...

Writing to Foreign catalog