cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User16826992666
by Valued Contributor
  • 2567 Views
  • 1 replies
  • 0 kudos
  • 2567 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

Standard tiers are allowed to have 1000 saved jobs. Premium tiers have a higher limit at 1500. Some clouds have an enterprise tier which has a saved job limit of 2000. A workspace is limited to 1000 concurrent job runs. A 429 Too Many Requests respon...

  • 0 kudos
User16826992185
by Databricks Employee
  • 5228 Views
  • 1 replies
  • 0 kudos

Delta vs. Parquet

I'm curious about the benefits of using the Delta file format vs. Parquet. Is there any downside to using Delta?

  • 5228 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

Not really. You get upsides like transactions, time travel, upsert/merge/deletes. There is some cost to that, as Delta manages that by writing and managing many smaller Parquet files and has to re-read them to recreate the current or past state of th...

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 2175 Views
  • 1 replies
  • 0 kudos

Resolved! I have a streaming aggregation query with highly variable micro-batch processing times. Seeing a lot of GC pauses in the logs . Any pointers on how to debug ?

Though the data volume is relatively even, the  streaming aggregation query is showing highly variable micro-batch processing times

  • 2175 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

By default, the state data (streaming aggregation query) is maintained in the JVM memory of the executors and large number of state objects could put memory pressure on the JVM causing high GC pauses. If you have stateful operations in your streamin...

  • 0 kudos
Anonymous
by Not applicable
  • 2191 Views
  • 1 replies
  • 0 kudos
  • 2191 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

DBFS is the "Databricks File System", but really it's just a shim / wrapper on top of distributed storage, that makes files in S3 or ADLS look like local files under the path /dbfs/... This can be really useful when working with libraries that do not...

  • 0 kudos
User16826992666
by Valued Contributor
  • 9484 Views
  • 1 replies
  • 1 kudos

Resolved! When should I choose a different driver type on my cluster vs the worker type?

When creating a cluster the driver type defaults to choose the same type as the workers, and this is what I usually choose. But in what of situation would I want to choose a different driver type?

  • 9484 Views
  • 1 replies
  • 1 kudos
Latest Reply
sean_owen
Databricks Employee
  • 1 kudos

Using the same instance type is a fine default. If you know that you need very large workers, but little happens on the driver, maybe you can save money with a smaller driver. Conversely, you may know that some parts of your notebook involve a lot of...

  • 1 kudos
User16826992666
by Valued Contributor
  • 2737 Views
  • 1 replies
  • 0 kudos

Resolved! Is there a limit to the number of data points displayed in notebook visualizations?

I know that when you display the results of queries in notebooks there is a limit to the number of rows that are shown. Is there a similar limit to the results that are displayed in visuals within notebooks?

  • 2737 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

Yes, still limited to 1000 rows / data points. However, when your visualization involves things like sums or averages of a Spark DataFrame's result, those will be performed on the cluster, so would involve maybe many more than 1000 data points, even ...

  • 0 kudos
User16826992666
by Valued Contributor
  • 9240 Views
  • 1 replies
  • 0 kudos

Resolved! When should I use single node clusters vs standard?

I see that single node is a cluster mode option that I have when creating clusters. When should I use this compared to the standard mode?

  • 9240 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

Single-node, like the name implies, is a single machine. It still has Spark, just a local cluster. This is a good choice if you are running a workload that does not use Spark, or only needs it for data access. One good example is a small deep learnin...

  • 0 kudos
User16826992666
by Valued Contributor
  • 2441 Views
  • 1 replies
  • 0 kudos
  • 2441 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

You don't have to. If you don't have a huge data set, there may not be much value in Spark ML over anything else. There are also other distributed modeling libraries that work on Spark like xgboost, and Horovod + TF, Keras, Pytorch. Spark ML is a goo...

  • 0 kudos
User16826992666
by Valued Contributor
  • 10401 Views
  • 2 replies
  • 0 kudos

Why do Spark MLlib models only accept a vector column as input?

In other libraries I can just use the feature columns themselves as inputs, why do I need to make a vector out of my features when I use MLlib?

  • 10401 Views
  • 2 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

Yeah, it's more a design choice. Rather than have every implementation take column(s) params, this is handled once in VectorAssembler for all of them. One way or the other, most implementations need a vector of inputs anyway. VectorAssembler can do s...

  • 0 kudos
1 More Replies
User16826992666
by Valued Contributor
  • 3119 Views
  • 1 replies
  • 0 kudos

Resolved! How does cluster autoscaling work?

What determines when the cluster autoscaling activates to add and remove workers? Also, can it be adjusted?

  • 3119 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

> What determines when the cluster autoscaling activates to add and remove workersDuring scale-down, the service removes a worker only if it is idle and does not contain any shuffle data. This allows aggressive resizing without killing tasks or recom...

  • 0 kudos
Digan_Parikh
by Valued Contributor
  • 1835 Views
  • 1 replies
  • 0 kudos

Resolved! S3 bucket mount

If you mount an S3 bucket using an AWS instance profile, does that mounted bucket become accessible to just that 1 cluster or to other clusters in that workspace as well?

  • 1835 Views
  • 1 replies
  • 0 kudos
Latest Reply
Digan_Parikh
Valued Contributor
  • 0 kudos

Mounts are global to all clusters but as a best practice, you can use IAM roles to prevent access tot he underlying data. To take this one step further, you can use IAM credential passthrough rather than instance profile because instance profile can ...

  • 0 kudos
Srikanth_Gupta_
by Databricks Employee
  • 1832 Views
  • 1 replies
  • 0 kudos
  • 1832 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Delta cache is an automatic hands-free solution that leverages high read speeds of modern SSDs to transparently create copies of remote files in nodes’ local storage to accelerate data reads . In comparison, you have choose what and when to cache wit...

  • 0 kudos
Digan_Parikh
by Valued Contributor
  • 1533 Views
  • 1 replies
  • 0 kudos

Resolved! Widgets - Way to validate config parameters

Can you use widgets to validate config parameters for notebooks?

  • 1533 Views
  • 1 replies
  • 0 kudos
Latest Reply
Digan_Parikh
Valued Contributor
  • 0 kudos

For example:folder = dbutils.widgets.get("Folder") if folder == "": raise Exception("Folder missing")or to get spark settings you can use:spark.conf.get("my_property")Learn more about them here - https://docs.databricks.com/notebooks/widgets.html

  • 0 kudos
User16826992666
by Valued Contributor
  • 1778 Views
  • 1 replies
  • 0 kudos

Can you use external job scheduling tools to start and schedule Databricks jobs?

I am wondering if I have to use the Databricks jobs scheduler to kick off Databricks jobs. My company already uses another job scheduler for our workflows and it would be useful to add our Databricks jobs to that flow.

  • 1778 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

You could use external tools to schedule jobs in Databricks. Here is a blogpost explaining how Databricks could be used along with Azure Data factory . This blog explains how to use Airflow with DatabricksIt is worth noting that a lot Databricks's f...

  • 0 kudos
Anonymous
by Not applicable
  • 7558 Views
  • 1 replies
  • 0 kudos

Resolved! Scheduling cluster start and stop time

I want to schedule cluster to start in the morning and shut down by evening. How can I achieve that?

  • 7558 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

You can call the REST API to schedule cluster starts and stops from a scheduler.See https://docs.databricks.com/dev-tools/api/latest/clusters.htmlPRO Tip: Use code generation tools within Postman to generate scripts in the language of your choice.

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels