cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

User16752241457
by New Contributor II
  • 1941 Views
  • 1 replies
  • 0 kudos

Saving display() plots

Is there an easy way I can save the plots generated by the display() cmd?

  • 1941 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16788317454
New Contributor III
  • 0 kudos

Plots generated via the display() command are automatically saved under /FileStore/plots. See the documentation for more info: https://docs.databricks.com/data/filestore.html#filestore.However, perhaps an easier approach to save/revisit plots is to u...

  • 0 kudos
User16788317454
by New Contributor III
  • 1217 Views
  • 1 replies
  • 0 kudos
  • 1217 Views
  • 1 replies
  • 0 kudos
Latest Reply
j_weaver
New Contributor III
  • 0 kudos

If you are talking about distributed training of a single XGBoost model, there is no built-in capability in SparkML. SparkML supports gradient boosted trees, but not XGBoost specifically. However, there are 3rd party packages, such as XGBoost4J that ...

  • 0 kudos
j_weaver
by New Contributor III
  • 1446 Views
  • 1 replies
  • 0 kudos
  • 1446 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16788317454
New Contributor III
  • 0 kudos

With Spark, there are a few ways you can scale your model: TrainingHyperparameter tuningInferenceIf you're looking to train one model across multiple workers, you can leverage Horovod. It's an open source project designed to simplify distributed neur...

  • 0 kudos
jose_gonzalez
by Databricks Employee
  • 1191 Views
  • 2 replies
  • 0 kudos

Cluster goes unresponsive after installing a library

Right after I install a library in my cluster, my cluster goes unresponsive and nothing runs. How to solve this issue?

  • 1191 Views
  • 2 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

it is a standard cluster. It is happening for all libraries. is there a way to debug or show the errors messages if any?

  • 0 kudos
1 More Replies
j_weaver
by New Contributor III
  • 1236 Views
  • 1 replies
  • 0 kudos
  • 1236 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16752246141
New Contributor III
  • 0 kudos

Pandas works for single machine computations, so any pandas code you write on Databricks will run on the driver of the cluster. Pyspark and Koalas are both distributed frameworks for when you have large datasets. You can use Pyspark and Koalas inte...

  • 0 kudos
Joseph_B
by Databricks Employee
  • 944 Views
  • 1 replies
  • 0 kudos

When doing hyperparameter tuning with Hyperopt, when should I use SparkTrials? Does it work with both single-machine ML (like sklearn) and distributed ML (like Apache Spark ML)?

I want to know how to use Hyperopt in different situations:Tuning a single-machine algorithm from scikit-learn or single-node TensorFlowTuning a distributed algorithm from Spark ML or distributed TensorFlow / Horovod

  • 944 Views
  • 1 replies
  • 0 kudos
Latest Reply
Joseph_B
Databricks Employee
  • 0 kudos

The right question to ask is indeed: Is the algorithm you want to tune single-machine or distributed?If it's a single-machine algorithm like any from scikit-learn, then you can use SparkTrials with Hyperopt to distribute hyperparameter tuning.If it's...

  • 0 kudos
FrancisLau1897
by New Contributor
  • 20393 Views
  • 7 replies
  • 0 kudos

Getting "java.lang.ClassNotFoundException: Failed to find data source: xml" error when loading XML

Both the following commands fail df1 = sqlContext.read.format("xml").load(loadPath) df2 = sqlContext.read.format("com.databricks.spark.xml").load(loadPath) with the following error message: java.lang.ClassNotFoundException: Failed to find data sour...

  • 20393 Views
  • 7 replies
  • 0 kudos
Latest Reply
alvaroagx
New Contributor II
  • 0 kudos

Hi, If you are getting this error is due com.sun.xml.bind library is obsolete now. You need to download org.jvnet.jaxb2.maven package into a library by using Maven Central and attach that into a cluster. Then you are going to be able to use xml...

  • 0 kudos
6 More Replies
User16826988857
by Databricks Employee
  • 2609 Views
  • 0 replies
  • 0 kudos

How to allow Table deletion without requiring ownership on table? Problem Description In DBR 6 (and earlier), a non-admin user can delete a table that...

How to allow Table deletion without requiring ownership on table?Problem DescriptionIn DBR 6 (and earlier), a non-admin user can delete a table that the user doesn't own, as long as the user has ownership on the table's parent database (perhaps throu...

  • 2609 Views
  • 0 replies
  • 0 kudos
Digan_Parikh
by Valued Contributor
  • 1689 Views
  • 0 replies
  • 0 kudos

Widgets - Way to validate config parameters

Yes, you can use the widgets api to have some control to validate the input before you pass the values to the rest of your codeFor example:folder = dbutils.widgets.get("Folder") if folder == "": raise Exception("Folder missing")or to get spark se...

  • 1689 Views
  • 0 replies
  • 0 kudos
Anonymous
by Not applicable
  • 10521 Views
  • 1 replies
  • 0 kudos

Resolved! Ideal number and size of partitions

Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query. Conversely, the 200 partitions might be too small if the data is big. So ho...

  • 10521 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the valueHowever if you have multiple workloads with different data volumes, instead ...

  • 0 kudos
Anonymous
by Not applicable
  • 2882 Views
  • 2 replies
  • 1 kudos

Resolved! Difference between Delta Live Tables and Multitask Jobs

When should I use one over the other? There seems to be an overlap of some functionality

  • 2882 Views
  • 2 replies
  • 1 kudos
Latest Reply
aladda
Databricks Employee
  • 1 kudos

Delta Live Tables are targeted towards building an ETL pipeline where several Delta tables are interconnected from a flow perspective and in a single notebook. Multi-task Jobs is more generic orchestration framework that allows you to execute various...

  • 1 kudos
1 More Replies
User16783855117
by Contributor II
  • 825 Views
  • 0 replies
  • 0 kudos

Is there a way to know if Adaptive Query Execution with Spark 3 has changed my Spark plan?

From the demo notebook located here (https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html) it seems like the approach to demonstrate AQE was working was to first calculate the Spark query plan before r...

  • 825 Views
  • 0 replies
  • 0 kudos
RonanStokes_DB
by Databricks Employee
  • 1130 Views
  • 1 replies
  • 0 kudos

How can I prevent users from consuming excessive costs for jobs?

If users are allowed to create clusters, how can an operations team prevent them from consuming excessive costs?

  • 1130 Views
  • 1 replies
  • 0 kudos
Latest Reply
RonanStokes_DB
Databricks Employee
  • 0 kudos

Cluster policies can be used to constrain the node types that are available to users to create clusters, the number of nodes they can use to create clusters and the max DBU consumption they can use.The following resources provide further information:...

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels