Data Engineering

Forum Posts

Sorted by:

Start a conversation

by MoJaMa • Databricks Employee

06-17-2021 6:27:11 PM

1147 Views
1 replies
0 kudos

Does Databricks support a Centralized Feature Store?

Data Engineering

1147 Views
1 replies
0 kudos

06-17-2021 6:27:11 PM

View Replies

Latest Reply

MoJaMa
Databricks Employee

06-17-2021 6:29:49 PM

0 kudos

Most possibly in future as we progress down our Roadmap.Currently it is per-workspace, and only accessible in Databricks notebooks/jobs.Please refer to our docs:https://docs.databricks.com/applications/machine-learning/feature-store.html#known-limita...

0 kudos

06-17-2021 6:29:49 PM

by MoJaMa • Databricks Employee

06-17-2021 6:10:39 PM

1522 Views
1 replies
0 kudos

What is this Photon Engine I keep hearing about?

Data Engineering

1522 Views
1 replies
0 kudos

06-17-2021 6:10:39 PM

View Replies

Latest Reply

MoJaMa
Databricks Employee

06-17-2021 6:13:51 PM

0 kudos

It's our new high-performance runtime, using a native vectorized engine developed in C++.Please see our blog for a great overview. https://databricks.com/blog/2021/06/17/announcing-photon-public-preview-the-next-generation-query-engine-on-the-databri...

0 kudos

06-17-2021 6:13:51 PM

by MoJaMa • Databricks Employee

06-17-2021 5:57:24 PM

1127 Views
1 replies
0 kudos

Does Databricks still require CreateKeyPair and DeleteKeyPair permissions in the cross-account IAM roles on AWS?

Data Engineering

1127 Views
1 replies
0 kudos

06-17-2021 5:57:24 PM

View Replies

Latest Reply

MoJaMa
Databricks Employee

06-17-2021 5:58:56 PM

0 kudos

We used to require this, but starting June 9, 2021 we no longer do, and have improved our E2 security posture.See https://docs.databricks.com/administration-guide/account-api/iam-role.html for the current permissions required.

0 kudos

06-17-2021 5:58:56 PM

by MoJaMa • Databricks Employee

06-17-2021 5:55:01 PM

1489 Views
1 replies
1 kudos

How can I understand Photon pricing for Data Engineering and estimate costs for a workload?

Data Engineering

1489 Views
1 replies
1 kudos

06-17-2021 5:55:01 PM

View Replies

Latest Reply

MoJaMa
Databricks Employee

06-17-2021 5:55:44 PM

1 kudos

Use the calculator herehttps://databricks.com/product/aws-pricing/instance-typesOpen two windows side by side, pick Photon and Non-Photon instances of the same type and compare.

1 kudos

06-17-2021 5:55:44 PM

by sajith_appukutt • Databricks Employee

06-08-2021 10:22:38 PM

2749 Views
1 replies
1 kudos

Resolved! Are there any ways to automatically cleanup temporary files created in s3 by the Amazon Redshift connector

The Amazon Redshift data source in Databricks seems to be using S3 for storing intermediate results. Are there any ways to automatically cleanup temporary files created in S3

Data Engineering

2749 Views
1 replies
1 kudos

06-08-2021 10:22:38 PM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-17-2021 5:29:49 PM

1 kudos

You could use storage lifecycle policy for the s3 bucket used for storing intermediate results and configure expiration actions. This way temporary/intermediate results would be automatically cleaned up

1 kudos

06-17-2021 5:29:49 PM

by User16752246553 • Databricks Employee

06-10-2021 10:57:58 AM

1787 Views
1 replies
1 kudos

How does Vectorized Pandas UDF work?

Do Vectorized Pandas UDFs apply to batches of data sequentially or in parallel? And is there a way to set the batch size?

Data Engineering

1787 Views
1 replies
1 kudos

06-10-2021 10:57:58 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-17-2021 5:23:35 PM

1 kudos

>How does Vectorized Pandas UDF work?Here is a video explaining the internals of Pandas UDFs (a.k.a. Vectorized UDFs) - https://youtu.be/UZl0pHG-2HA?t=123 . They use Apache Arrow, to exchange data directly between JVM and Python driver/executors wit...

1 kudos

06-17-2021 5:23:35 PM

by User16826992666 • Databricks Employee

06-16-2021 8:34:52 PM

3205 Views
1 replies
0 kudos

Resolved! What is the difference between a trigger once stream and a normal one time write?

It seems to me like both of these would accomplish the same thing in the end. Do they use different mechanisms to accomplish it though? Are there any hidden costs to streaming to consider?

Data Engineering

3205 Views
1 replies
0 kudos

06-16-2021 8:34:52 PM

View Replies

Latest Reply

Ryan_Chynoweth
Databricks Employee

06-17-2021 5:03:45 PM

0 kudos

The biggest reason to use the streaming API over the non-stream API would be to enable the checkpoint log to maintain a processing log. It is most common for people to use the trigger once when they want to only process the changes between executions...

0 kudos

06-17-2021 5:03:45 PM

by User16752240150 • Databricks Employee

06-04-2021 12:34:03 PM

2157 Views
1 replies
0 kudos

What's the best way to use hyperopt to train a spark.ml model and track automatically with mlflow?

I've read this article, which covers:Using CrossValidator or TrainValidationSplit to track hyperparameter tuning (no hyperopt). Only random/grid searchparallel "single-machine" model training with hyperopt using hyperopt.SparkTrials (not spark.ml)"Di...

Data Engineering

2157 Views
1 replies
0 kudos

06-04-2021 12:34:03 PM

View Replies

Latest Reply

sean_owen
Databricks Employee

06-17-2021 5:00:45 PM

0 kudos

It's actually pretty simple: use hyperopt, but use "Trials" not "SparkTrials". You get parallelism from Spark, not from the tuning process.

0 kudos

06-17-2021 5:00:45 PM

by User16826992666 • Databricks Employee

06-16-2021 8:57:38 PM

1860 Views
1 replies
0 kudos

Resolved! When should I create a Bloom Filter Index on my Delta table?

Data Engineering

1860 Views
1 replies
0 kudos

06-16-2021 8:57:38 PM

View Replies

Latest Reply

Ryan_Chynoweth
Databricks Employee

06-17-2021 5:00:40 PM

0 kudos

A bloom filter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text. The Bloom filter operates by either stating that data is definitively not in the file, or that i...

0 kudos

06-17-2021 5:00:40 PM

by User16826994223 • Databricks Employee

06-17-2021 12:16:14 AM

1573 Views
1 replies
0 kudos

Delta concurrency write Issue

What is concurrent issue in delta, If at a time if we try to write same delta table , it some times fail , how to mitigate that

Data Engineering

1573 Views
1 replies
0 kudos

06-17-2021 12:16:14 AM

View Replies

Latest Reply

Ryan_Chynoweth
Databricks Employee

06-17-2021 4:57:54 PM

0 kudos

Delta Lake uses optimistic concurrency control to provide transactional guarantees between writes. Read: Reads (if needed) the latest available version of the table to identify which files need to be modified (that is, rewritten).Write: Stages all th...

0 kudos

06-17-2021 4:57:54 PM

by sajith_appukutt • Databricks Employee

06-11-2021 2:37:20 PM

1532 Views
1 replies
1 kudos

Resolved! What are the list of connections that I need to open up in my hub inspection firewall for Databricks to work in AWS

Data Engineering

1532 Views
1 replies
1 kudos

06-11-2021 2:37:20 PM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-17-2021 4:54:14 PM

1 kudos

You'd need to open connections to Databricks web applicationDatabricks secure cluster connectivity (SCC) relayAWS S3 global URLAWS S3 regional URLAWS STS global URLAWS STS regional URLAWS Kinesis regional URLTable metastore RDS regional URL (by data ...

1 kudos

06-17-2021 4:54:14 PM

by Anonymous • Not applicable

06-04-2021 12:42:20 PM

1681 Views
2 replies
0 kudos

Resolved! Collaborative features

What do you mean by collaborative data science? What collaboration features do you support?

Data Engineering

1681 Views
2 replies
0 kudos

06-04-2021 12:42:20 PM

View Replies

Latest Reply

sean_owen
Databricks Employee

06-17-2021 4:50:55 PM

0 kudos

This primarily refers to the fact that notebooks can be shared to the whole org, to groups, to users, and can be limited to read/write/execute. You could argue that MLflow is also a form of collaboration, where multiple users can share an experiment ...

0 kudos

06-17-2021 4:50:55 PM

1 More Replies

by Srikanth_Gupta_ • Databricks Employee

06-17-2021 10:22:03 AM

3183 Views
2 replies
0 kudos

What are best instance types to use Delta Lake on AWS, Azure and GCP?

Best instance types to use Delta in a better way, are there any recommendations?Example: i3.xlarge vs m5.2x large vs D3v2

Data Engineering

3183 Views
2 replies
0 kudos

06-17-2021 10:22:03 AM

View Replies

Latest Reply

Mooune_DBU
Databricks Employee

06-17-2021 4:47:35 PM

0 kudos

Depending on your queries, if you're looking for Delta Cache Optimized instances, here's the list per provider:AWS: i3.* (i.e. i3.xlarge)Azure: Ls-types (i.e. L4sv2)GCP: n2-highmem-*

0 kudos

06-17-2021 4:47:35 PM

1 More Replies

by User16790091296 • Databricks Employee

06-04-2021 11:42:27 AM

2635 Views
1 replies
0 kudos

Why doesn’t high concurrency cluster support Scala?

Data Engineering

2635 Views
1 replies
0 kudos

06-04-2021 11:42:27 AM

View Replies

Latest Reply

sean_owen
Databricks Employee

06-17-2021 4:45:25 PM

0 kudos

Broadly, it's because high-concurrency cluster have to have much more control of user workloads in order to enforce resource sharing constraints. Scala is the lowest-level language you can access in Databricks, as you execute directly in the JVM, and...

0 kudos

06-17-2021 4:45:25 PM

by Anonymous • Not applicable

06-07-2021 2:53:55 PM

9442 Views
2 replies
0 kudos

Resolved! Feature Store Error message: ModuleNotFoundError: No module named 'databricks.feature_store'

How do I fix this?

Data Engineering

9442 Views
2 replies
0 kudos

06-07-2021 2:53:55 PM

View Replies

Latest Reply

sean_owen
Databricks Employee

06-17-2021 4:31:06 PM

0 kudos

Use Databricks runtime 8.3 ML or later.

0 kudos

06-17-2021 4:31:06 PM

1 More Replies

Databricks Community

Forum Posts

Does Databricks support a Centralized Feature Store?

What is this Photon Engine I keep hearing about?

Does Databricks still require CreateKeyPair and DeleteKeyPair permissions in the cross-account IAM roles on AWS?

How can I understand Photon pricing for Data Engineering and estimate costs for a workload?

Resolved! Are there any ways to automatically cleanup temporary files created in s3 by the Amazon Redshift connector

How does Vectorized Pandas UDF work?

Resolved! What is the difference between a trigger once stream and a normal one time write?

What's the best way to use hyperopt to train a spark.ml model and track automatically with mlflow?

Resolved! When should I create a Bloom Filter Index on my Delta table?

Delta concurrency write Issue

Resolved! What are the list of connections that I need to open up in my hub inspection firewall for Databricks to work in AWS

Resolved! Collaborative features

What are best instance types to use Delta Lake on AWS, Azure and GCP?

Why doesn’t high concurrency cluster support Scala?

Resolved! Feature Store Error message: ModuleNotFoundError: No module named 'databricks.feature_store'

Join Us as a Local Community Builder!

Hive Metastore End of Life

DLT Pipeline with unknown deleted source data

[Databricks Asset Bundles] Bug: driver_node_type_i...

Global Parameter at the Pipeline level in Lakeflow...

oracle sequence number