Data Engineering

Forum Posts

Sorted by:

Start a conversation

by Anonymous • Not applicable

06-17-2021 11:59:07 AM

4698 Views
2 replies
1 kudos

Resolved! Any recommendations to reduce the cluster spin up time?

Data Engineering

4698 Views
2 replies
1 kudos

06-17-2021 11:59:07 AM

View Replies

Latest Reply

Srikanth_Gupta_
Valued Contributor

06-17-2021 1:15:48 PM

1 kudos

Databricks pool concept may help reduce cluster start up time. more details are here

1 kudos

06-17-2021 1:15:48 PM

1 More Replies

by Anonymous • Not applicable

06-17-2021 11:44:24 AM

8599 Views
1 replies
0 kudos

Resolved! Is there a way to list packages / version installed on a cluster ?

Data Engineering

8599 Views
1 replies
0 kudos

06-17-2021 11:44:24 AM

View Replies

Latest Reply

Mooune_DBU
Valued Contributor

06-17-2021 12:58:24 PM

0 kudos

If you're looking for python packages, you can use standard conda list commands (see example below):%conda list

0 kudos

06-17-2021 12:58:24 PM

by User16826992666 • Valued Contributor

06-16-2021 3:28:54 PM

687 Views
1 replies
0 kudos

Resolved! Is there any file size overhead when I save models using MLflow?

Data Engineering

687 Views
1 replies
0 kudos

06-16-2021 3:28:54 PM

View Replies

Latest Reply

sean_owen
Honored Contributor II

06-17-2021 12:56:25 PM

0 kudos

There shouldn't be. Generally speaking, models will be serialized according to their 'native' format for well-known libraries like Tensorflow, xgboost, sklearn, etc. Custom model will be saved with pickle. The files exist on distributed storage as ar...

0 kudos

06-17-2021 12:56:25 PM

by User16826992666 • Valued Contributor

06-16-2021 7:44:30 PM

766 Views
1 replies
0 kudos

Resolved! What is the point of the model staging and promotion functions in MLflow?

Why not just directly deploy the model where you need it in production?

Data Engineering

766 Views
1 replies
0 kudos

06-16-2021 7:44:30 PM

View Replies

Latest Reply

sean_owen
Honored Contributor II

06-17-2021 12:55:01 PM

0 kudos

The Model Registry is mostly a workflow tool. It helps 'gate' the process, so that (for example) only authorized users can set a model to be the newest Production version of a model - that's not something just anyone should be able to do!The Registry...

0 kudos

06-17-2021 12:55:01 PM

by User16826992666 • Valued Contributor

06-16-2021 8:52:14 PM

1188 Views
1 replies
0 kudos

Resolved! Should I use Z Ordering on my Delta table every time I run Optimize?

Wondering if it always makes sense or if there are some situations where you might only want to run optimize

Data Engineering

1188 Views
1 replies
0 kudos

06-16-2021 8:52:14 PM

View Replies

Latest Reply

Srikanth_Gupta_
Valued Contributor

06-17-2021 12:47:45 PM

0 kudos

Its good idea to optimize at end of each batch job to avoid any small files situation, Z order is optional and can be applied on few non partition columns which are used frequently in read operationsZORDER BY -> Colocate column information in the sam...

0 kudos

06-17-2021 12:47:45 PM

by Anonymous • Not applicable

06-17-2021 11:34:08 AM

728 Views
1 replies
0 kudos

Resolved! Is it possible to write single Spark stream to 2 different Delta tables? Recommendations around that?

Data Engineering

728 Views
1 replies
0 kudos

06-17-2021 11:34:08 AM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-17-2021 12:36:20 PM

0 kudos

In this scenario, the best option would be to have a single readStream reading a source delta table. Since checkpoint logs are controlled when writing to delta tables you would be able to maintain separate logs for each of your writeStreams. I would...

0 kudos

06-17-2021 12:36:20 PM

by User16826992666 • Valued Contributor

06-16-2021 9:00:14 PM

1131 Views
1 replies
1 kudos

Resolved! If someone saves a flat file in a Databricks notebook without specifying a location, where does it go?

I ran the code block below and now I can't find the file. Where would this get saved since no location was specified?

Data Engineering

1131 Views
1 replies
1 kudos

06-16-2021 9:00:14 PM

View Replies

Latest Reply

sean_owen
Honored Contributor II

06-17-2021 11:28:19 AM

1 kudos

In this case, you are just writing a file to the working directory of the driver process. It'll be under something like /home/[user] on the local file system. Anything you write to "/dbfs/..." however goes to distributed storage, even though it looks...

1 kudos

06-17-2021 11:28:19 AM

by User16826994223 • Honored Contributor III

06-17-2021 12:24:15 AM

617 Views
1 replies
0 kudos

Major changes in spark 3.0

What are the major changes released in spark 3.0

Data Engineering

617 Views
1 replies
0 kudos

06-17-2021 12:24:15 AM

View Replies

Latest Reply

sean_owen
Honored Contributor II

06-17-2021 11:27:14 AM

0 kudos

Check out https://spark.apache.org/docs/latest/sql-migration-guide.html if you're looking for potentially breaking changes you need to be aware of, for any version.For a general overview of the new features, see https://databricks.com/blog/2020/06/18...

0 kudos

06-17-2021 11:27:14 AM

by User16857281869 • New Contributor II

06-17-2021 1:34:22 AM

705 Views
1 replies
0 kudos

How do I benefit from parallelisation when doing machine learning?

There are in principle four distinct ways of using parallelisation when doing machine learning. Any combination of these can speed up the whole pipeline significantly.1) Using spark distributed processing in feature engineering 2) When the data set...

Data Engineering

705 Views
1 replies
0 kudos

06-17-2021 1:34:22 AM

View Replies

Latest Reply

sean_owen
Honored Contributor II

06-17-2021 11:25:11 AM

0 kudos

Good summary! yes those are the main strategies I can think of.

0 kudos

06-17-2021 11:25:11 AM

by User16826992666 • Valued Contributor

06-16-2021 4:08:38 PM

903 Views
2 replies
0 kudos

Do I have to run .cache() on my dataframe before returning aggregations like count?

Data Engineering

903 Views
2 replies
0 kudos

06-16-2021 4:08:38 PM

View Replies

Latest Reply

sean_owen
Honored Contributor II

06-17-2021 11:24:29 AM

0 kudos

You do not have to cache anything to make it work. You would decide that based on whether you want to spend memory/storage to avoid recomputing the DataFrame, like when you may use it in multiple operations afterwards.

0 kudos

06-17-2021 11:24:29 AM

1 More Replies

by User16826992666 • Valued Contributor

06-17-2021 8:02:38 AM

1895 Views
1 replies
0 kudos

Resolved! What's the difference between SparkML and Spark MLlib?

I have heard people talk about SparkML but when reading documentation it talks about MLlib. I don't understand the difference, could anyone help me understand this?

Data Engineering

1895 Views
1 replies
0 kudos

06-17-2021 8:02:38 AM

View Replies

Latest Reply

sean_owen
Honored Contributor II

06-17-2021 11:23:47 AM

0 kudos

They're not really different. Before DataFrames in Spark, older implementations of ML algorithms build on the RDD API. This is generally called "Spark MLlib". After DataFrames, some newer implementations were added as wrappers on top of the old ones ...

0 kudos

06-17-2021 11:23:47 AM

by sajith_appukutt • Honored Contributor II

06-11-2021 5:32:47 PM

2042 Views
1 replies
1 kudos

Resolved! How can I configure a custom DNS for my databricks workspace to talk to my on-premises systems

Data Engineering

2042 Views
1 replies
1 kudos

06-11-2021 5:32:47 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-17-2021 11:18:28 AM

1 kudos

You could set up dnsmasq to configure routing between your Databricks workspace and your on-premise network. More details here

1 kudos

06-17-2021 11:18:28 AM

by Anonymous • Not applicable

06-17-2021 9:16:41 AM

4470 Views
1 replies
0 kudos

Resolved! How can I download a file from DBFS to my local computer?

Data Engineering

4470 Views
1 replies
0 kudos

06-17-2021 9:16:41 AM

View Replies

Latest Reply

sean_owen
Honored Contributor II

06-17-2021 11:18:00 AM

0 kudos

Use the Databricks CLI: https://docs.databricks.com/dev-tools/cli/index.htmldatabricks fs cp dbfs:/remote/path /local/path

0 kudos

06-17-2021 11:18:00 AM

by sajith_appukutt • Honored Contributor II

06-11-2021 2:45:35 PM

809 Views
1 replies
0 kudos

Resolved! How can I reduce the risk of data exfiltration while using Databricks

Data Engineering

809 Views
1 replies
0 kudos

06-11-2021 2:45:35 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-17-2021 10:08:14 AM

0 kudos

Databricks allows network customizations / hardening from a security point of view to reduce risks like Data exfiltration. For more detailsData Exfiltration Protection With Databricks on AWSData Exfiltration Protection with Azure Databricks

0 kudos

06-17-2021 10:08:14 AM

by User16826994223 • Honored Contributor III

06-17-2021 7:59:52 AM

638 Views
1 replies
0 kudos

Z ordering best practices

What are the best practices around Z ordering, Should be include as Manu column as Possible in Z order or lesser the better and why?

Data Engineering

638 Views
1 replies
0 kudos

06-17-2021 7:59:52 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-17-2021 10:01:41 AM

0 kudos

With Z-order and Hilbert curves, the effectiveness of clustering decreases with each column added - so you'd want to zorder only the columns that you's actually use so that it's speed up your workloads.

0 kudos

06-17-2021 10:01:41 AM

User

Count

1601

736

343

284

246

Databricks

Forum Posts

Resolved! Any recommendations to reduce the cluster spin up time?

Resolved! Is there a way to list packages / version installed on a cluster ?

Resolved! Is there any file size overhead when I save models using MLflow?

Resolved! What is the point of the model staging and promotion functions in MLflow?

Resolved! Should I use Z Ordering on my Delta table every time I run Optimize?

Resolved! Is it possible to write single Spark stream to 2 different Delta tables? Recommendations around that?

Resolved! If someone saves a flat file in a Databricks notebook without specifying a location, where does it go?

Major changes in spark 3.0

How do I benefit from parallelisation when doing machine learning?

Do I have to run .cache() on my dataframe before returning aggregations like count?

Resolved! What's the difference between SparkML and Spark MLlib?

Resolved! How can I configure a custom DNS for my databricks workspace to talk to my on-premises systems

Resolved! How can I download a file from DBFS to my local computer?

Resolved! How can I reduce the risk of data exfiltration while using Databricks

Z ordering best practices

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...