Data Engineering

Forum Posts

Sorted by:

Start a conversation

by brickster_2018 • Esteemed Contributor

06-25-2021 3:32:06 PM

969 Views
0 replies
0 kudos

What is the difference between dynamic file pruning and Dynamic partition pruning

Data Engineering

969 Views
0 replies
0 kudos

06-25-2021 3:32:06 PM

by brickster_2018 • Esteemed Contributor

06-25-2021 3:30:55 PM

488 Views
0 replies
0 kudos

On a Table ACL enabled cluster, I can upload a file, but unable to read it

What is the use If I am able to upload and not able to read. I have only read access on the cluster

Data Engineering

488 Views
0 replies
0 kudos

06-25-2021 3:30:55 PM

by User16790091296 • Contributor II

06-25-2021 3:21:28 PM

777 Views
1 replies
0 kudos

How to manage multiple workspaces in my Databricks environment?

Data Engineering

777 Views
1 replies
0 kudos

06-25-2021 3:21:28 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-25-2021 3:29:35 PM

0 kudos

Depends on what you're looking for from a management perspective, but one option is the Account API which allows deploying/updating/configuring multiple workspaces in a given E2 accountUse this API to programmatically deploy, update, and delete works...

0 kudos

06-25-2021 3:29:35 PM

by brickster_2018 • Esteemed Contributor

06-25-2021 3:28:28 PM

1145 Views
1 replies
0 kudos

How to get Get tag values in Azure VM using metadata endpoint

Data Engineering

1145 Views
1 replies
0 kudos

06-25-2021 3:28:28 PM

View Replies

Latest Reply

brickster_2018
Esteemed Contributor

06-25-2021 3:28:56 PM

0 kudos

curl -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2020-09-01" | jq '.compute.tagsList[] | select(.name=="Creator") | .value'

0 kudos

06-25-2021 3:28:56 PM

by aladda • Honored Contributor II

06-25-2021 3:25:17 PM

2649 Views
1 replies
1 kudos

Why do Databricks deployments require 2 subnets for each workspace

Databricks must have access to at least two subnets for each workspace, with each subnet in a different availability zone per docs here

Data Engineering

2649 Views
1 replies
1 kudos

06-25-2021 3:25:17 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-25-2021 3:27:21 PM

1 kudos

This is designed for optimal user experience and as a capacity planning strategy where if instances are not available in one AZ, the other subnet in a different AZ can be used to deploy instance from instead

1 kudos

06-25-2021 3:27:21 PM

by brickster_2018 • Esteemed Contributor

06-25-2021 3:24:25 PM

1907 Views
1 replies
1 kudos

Resolved! How to capture the heap dump of the Spark driver JVM

Data Engineering

1907 Views
1 replies
1 kudos

06-25-2021 3:24:25 PM

View Replies

Latest Reply

brickster_2018
Esteemed Contributor

06-25-2021 3:26:29 PM

1 kudos

Find the DriverDaemon%sh jpsTake the heap dump%sh jmap -dump:live,format=b,file=pbs_worker_DriverDaemon.hprof 2413Copy out to download%sh cp pbs_worker_DriverDaemon.hprof /dbfs/FileStore/pbs_worker_04-30-2021T15-50-00.hprof

1 kudos

06-25-2021 3:26:29 PM

by User16826992666 • Valued Contributor

06-25-2021 10:42:17 AM

4655 Views
1 replies
0 kudos

Resolved! When using MLflow should I use log_model or save_model?

They seem to have similar functions. What is the recommended pattern here?

Data Engineering

4655 Views
1 replies
0 kudos

06-25-2021 10:42:17 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-25-2021 3:26:16 PM

0 kudos

mlflow.<model-type>.log_model(model, ...) saves the model to the MLflow tracking server. mlflow.<model-type>.save_model(model, modelpath) saved the model locally to a DBFS path.More details at https://docs.databricks.com/applications/mlflow/models...

0 kudos

06-25-2021 3:26:16 PM

by User16790091296 • Contributor II

06-25-2021 3:24:31 PM

457 Views
0 replies
0 kudos

Where can I get updated information on databricks features upcoming on GCP?

Data Engineering

457 Views
0 replies
0 kudos

06-25-2021 3:24:31 PM

by Anonymous • Not applicable

06-14-2021 2:26:10 PM

1484 Views
2 replies
2 kudos

Resolved! Spot instances - Best practice

We are having difficulties running our jobs with spot instances that get re-claimed by AWS during shuffles. Do we have any documentation / best-practices around this? We went through this article but is there anything else to keep in mind?

Data Engineering

1484 Views
2 replies
2 kudos

06-14-2021 2:26:10 PM

View Replies

Latest Reply

User16783853906
Contributor III

06-25-2021 3:08:38 PM

2 kudos

Due to the recent changes in AWS spot market place , legacy techniques like higher spot bid price (>100%) are ineffective to retain the acquired spot node and the instances can be lost in 2 minutes notice causing workloads to fail.To mitigate this, w...

2 kudos

06-25-2021 3:08:38 PM

1 More Replies

by Ryan_Chynoweth • Honored Contributor III

06-25-2021 3:06:13 PM

693 Views
1 replies
0 kudos

Can I have multiple queries in a pipeline writing to the same target table?

Data Engineering

693 Views
1 replies
0 kudos

06-25-2021 3:06:13 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-25-2021 3:06:35 PM

0 kudos

No, each table must be defined once. You can Use UNION If you need to combine multiple inputs to create a table. Adding or removing UNION from an incremental table is a breaking operation that requires a full-refresh.

0 kudos

06-25-2021 3:06:35 PM

by User16826992666 • Valued Contributor

06-25-2021 12:02:53 PM

796 Views
1 replies
0 kudos

Where can I find the tables I created in my Delta Live Tables pipeline?

I created several tables in my DLT pipeline but didn't specify a location to save them on creation. The pipleline seems to have ran, but I don't know where the tables actually are. How can I find them?

Data Engineering

796 Views
1 replies
0 kudos

06-25-2021 12:02:53 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-25-2021 3:04:17 PM

0 kudos

Checkout the configuration storage under settings . If you didn't specify the storage setting, the system will default to a location in dbfs:/pipelines/

0 kudos

06-25-2021 3:04:17 PM

by User16826987838 • Contributor

06-25-2021 12:58:21 PM

910 Views
1 replies
0 kudos

is there a way to write streaming data to delta and create a table in spark catalog at the same time?

Data Engineering

910 Views
1 replies
0 kudos

06-25-2021 12:58:21 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-25-2021 3:03:34 PM

0 kudos

Yes, in your write stream you can save it as a table in the delta format without a problem. In DBR 8, the default table format is delta. See this code, please note that the "..." is supplied to show that additional options may be required: df.writeSt...

0 kudos

06-25-2021 3:03:34 PM

by User16826992666 • Valued Contributor

06-25-2021 11:59:12 AM

1757 Views
1 replies
0 kudos

When using Delta Live Tables, how do I set a table to be incremental vs complete using Python?

When using SQL, I can use the Create Live Table command and the Create Incremental Live Table command to set the run type I want the table to use. But I don't seem to have that same syntax for python. How can I set this table type while using Python?

Data Engineering

1757 Views
1 replies
0 kudos

06-25-2021 11:59:12 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-25-2021 3:00:20 PM

0 kudos

The documentation at https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-user-guide.html#mixing-complete-tables-and-incremental-tables has an example the first two functions load data incrementally and the last one loads...

0 kudos

06-25-2021 3:00:20 PM

by brickster_2018 • Esteemed Contributor

06-25-2021 2:51:19 PM

5909 Views
1 replies
0 kudos

Resolved! What is the maximum limit of data that can be broadcasted using broadcast join

Data Engineering

5909 Views
1 replies
0 kudos

06-25-2021 2:51:19 PM

View Replies

Latest Reply

brickster_2018
Esteemed Contributor

06-25-2021 2:53:47 PM

0 kudos

By default, only 10 MB of data can be broadcasted. spark.sql.autoBroadcastJoinThreshold can be increased up to 8GBThere is an upper limit in terms of records as well. We can't broadcast more than 512m records. So its either 512m records or 8GB which...

0 kudos

06-25-2021 2:53:47 PM

by User16826992666 • Valued Contributor

06-25-2021 12:24:14 PM

1621 Views
1 replies
0 kudos

Is it possible to disable the maintenance job associated with a Delta Live Table?

After creating my Delta Live Table and running it once, I notice that the maintenance job that was created along with it continues to run at the scheduled time. I have not made any updated to the DLT, so the maintenance job theoretically shouldn't ha...

Data Engineering

1621 Views
1 replies
0 kudos

06-25-2021 12:24:14 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-25-2021 2:51:42 PM

0 kudos

You could change the table properties of the associated tables to disable automatic scheduled optimizations. More details at https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-language-ref.html#table-properties

0 kudos

06-25-2021 2:51:42 PM

User

Count

1603

744

348

285

247

Databricks Community

Forum Posts

What is the difference between dynamic file pruning and Dynamic partition pruning

On a Table ACL enabled cluster, I can upload a file, but unable to read it

How to manage multiple workspaces in my Databricks environment?

How to get Get tag values in Azure VM using metadata endpoint

Why do Databricks deployments require 2 subnets for each workspace

Resolved! How to capture the heap dump of the Spark driver JVM

Resolved! When using MLflow should I use log_model or save_model?

Where can I get updated information on databricks features upcoming on GCP?

Resolved! Spot instances - Best practice

Can I have multiple queries in a pipeline writing to the same target table?

Where can I find the tables I created in my Delta Live Tables pipeline?

is there a way to write streaming data to delta and create a table in spark catalog at the same time?

When using Delta Live Tables, how do I set a table to be incremental vs complete using Python?

Resolved! What is the maximum limit of data that can be broadcasted using broadcast join

Is it possible to disable the maintenance job associated with a Delta Live Table?

Compute Policy Does Not Install Libraries

Is there a way to let the DLT pipeline retry by it...

Can't create Catalog on Databricks on AWS

Executing Notebooks - Run All Cells vs Run All Bel...

getting Status code: 301 Moved Permanently error