cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User16826992666
by Valued Contributor
  • 9820 Views
  • 2 replies
  • 0 kudos

Why do Spark MLlib models only accept a vector column as input?

In other libraries I can just use the feature columns themselves as inputs, why do I need to make a vector out of my features when I use MLlib?

  • 9820 Views
  • 2 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

Yeah, it's more a design choice. Rather than have every implementation take column(s) params, this is handled once in VectorAssembler for all of them. One way or the other, most implementations need a vector of inputs anyway. VectorAssembler can do s...

  • 0 kudos
1 More Replies
User16826992666
by Valued Contributor
  • 2430 Views
  • 1 replies
  • 0 kudos

Resolved! How does cluster autoscaling work?

What determines when the cluster autoscaling activates to add and remove workers? Also, can it be adjusted?

  • 2430 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

> What determines when the cluster autoscaling activates to add and remove workersDuring scale-down, the service removes a worker only if it is idle and does not contain any shuffle data. This allows aggressive resizing without killing tasks or recom...

  • 0 kudos
Digan_Parikh
by Valued Contributor
  • 1396 Views
  • 1 replies
  • 0 kudos

Resolved! S3 bucket mount

If you mount an S3 bucket using an AWS instance profile, does that mounted bucket become accessible to just that 1 cluster or to other clusters in that workspace as well?

  • 1396 Views
  • 1 replies
  • 0 kudos
Latest Reply
Digan_Parikh
Valued Contributor
  • 0 kudos

Mounts are global to all clusters but as a best practice, you can use IAM roles to prevent access tot he underlying data. To take this one step further, you can use IAM credential passthrough rather than instance profile because instance profile can ...

  • 0 kudos
Srikanth_Gupta_
by Databricks Employee
  • 1413 Views
  • 1 replies
  • 0 kudos
  • 1413 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Delta cache is an automatic hands-free solution that leverages high read speeds of modern SSDs to transparently create copies of remote files in nodes’ local storage to accelerate data reads . In comparison, you have choose what and when to cache wit...

  • 0 kudos
Digan_Parikh
by Valued Contributor
  • 1263 Views
  • 1 replies
  • 0 kudos

Resolved! Widgets - Way to validate config parameters

Can you use widgets to validate config parameters for notebooks?

  • 1263 Views
  • 1 replies
  • 0 kudos
Latest Reply
Digan_Parikh
Valued Contributor
  • 0 kudos

For example:folder = dbutils.widgets.get("Folder") if folder == "": raise Exception("Folder missing")or to get spark settings you can use:spark.conf.get("my_property")Learn more about them here - https://docs.databricks.com/notebooks/widgets.html

  • 0 kudos
User16826992666
by Valued Contributor
  • 1551 Views
  • 1 replies
  • 0 kudos

Can you use external job scheduling tools to start and schedule Databricks jobs?

I am wondering if I have to use the Databricks jobs scheduler to kick off Databricks jobs. My company already uses another job scheduler for our workflows and it would be useful to add our Databricks jobs to that flow.

  • 1551 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

You could use external tools to schedule jobs in Databricks. Here is a blogpost explaining how Databricks could be used along with Azure Data factory . This blog explains how to use Airflow with DatabricksIt is worth noting that a lot Databricks's f...

  • 0 kudos
Anonymous
by Not applicable
  • 6407 Views
  • 1 replies
  • 0 kudos

Resolved! Scheduling cluster start and stop time

I want to schedule cluster to start in the morning and shut down by evening. How can I achieve that?

  • 6407 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

You can call the REST API to schedule cluster starts and stops from a scheduler.See https://docs.databricks.com/dev-tools/api/latest/clusters.htmlPRO Tip: Use code generation tools within Postman to generate scripts in the language of your choice.

  • 0 kudos
User16826992666
by Valued Contributor
  • 1437 Views
  • 1 replies
  • 0 kudos
  • 1437 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

There shouldn't be. Generally speaking, models will be serialized according to their 'native' format for well-known libraries like Tensorflow, xgboost, sklearn, etc. Custom model will be saved with pickle. The files exist on distributed storage as ar...

  • 0 kudos
User16826992666
by Valued Contributor
  • 1385 Views
  • 1 replies
  • 0 kudos

Resolved! What is the point of the model staging and promotion functions in MLflow?

Why not just directly deploy the model where you need it in production?

  • 1385 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

The Model Registry is mostly a workflow tool. It helps 'gate' the process, so that (for example) only authorized users can set a model to be the newest Production version of a model - that's not something just anyone should be able to do!The Registry...

  • 0 kudos
User16826992666
by Valued Contributor
  • 2156 Views
  • 1 replies
  • 0 kudos

Resolved! Should I use Z Ordering on my Delta table every time I run Optimize?

Wondering if it always makes sense or if there are some situations where you might only want to run optimize

  • 2156 Views
  • 1 replies
  • 0 kudos
Latest Reply
Srikanth_Gupta_
Databricks Employee
  • 0 kudos

Its good idea to optimize at end of each batch job to avoid any small files situation, Z order is optional and can be applied on few non partition columns which are used frequently in read operationsZORDER BY -> Colocate column information in the sam...

  • 0 kudos
Anonymous
by Not applicable
  • 1903 Views
  • 1 replies
  • 0 kudos
  • 1903 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

In this scenario, the best option would be to have a single readStream reading a source delta table. Since checkpoint logs are controlled when writing to delta tables you would be able to maintain separate logs for each of your writeStreams. I would...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 1017 Views
  • 1 replies
  • 0 kudos

Major changes in spark 3.0

What are the major changes released in spark 3.0

  • 1017 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

Check out https://spark.apache.org/docs/latest/sql-migration-guide.html if you're looking for potentially breaking changes you need to be aware of, for any version.For a general overview of the new features, see https://databricks.com/blog/2020/06/18...

  • 0 kudos
User16857281869
by New Contributor II
  • 1253 Views
  • 1 replies
  • 0 kudos

How do I benefit from parallelisation when doing machine learning?

There are in principle four distinct ways of using parallelisation when doing machine learning. Any combination of these can speed up the whole pipeline significantly.1) Using spark distributed processing in feature engineering 2) When the data set...

  • 1253 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

Good summary! yes those are the main strategies I can think of.

  • 0 kudos
User16826992666
by Valued Contributor
  • 1882 Views
  • 2 replies
  • 0 kudos
  • 1882 Views
  • 2 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

You do not have to cache anything to make it work. You would decide that based on whether you want to spend memory/storage to avoid recomputing the DataFrame, like when you may use it in multiple operations afterwards.

  • 0 kudos
1 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels