Data Engineering

Forum Posts

Sorted by:

Start a conversation

by User16752241457 • New Contributor II

06-10-2021 10:57:09 AM

1407 Views
1 replies
0 kudos

Saving display() plots

Is there an easy way I can save the plots generated by the display() cmd?

Data Engineering

1407 Views
1 replies
0 kudos

06-10-2021 10:57:09 AM

View Replies

Latest Reply

User16788317454
New Contributor III

06-10-2021 12:13:25 PM

0 kudos

Plots generated via the display() command are automatically saved under /FileStore/plots. See the documentation for more info: https://docs.databricks.com/data/filestore.html#filestore.However, perhaps an easier approach to save/revisit plots is to u...

0 kudos

06-10-2021 12:13:25 PM

by User16788317454 • New Contributor III

06-10-2021 10:53:28 AM

872 Views
1 replies
0 kudos

Resolved! I have a single node XGBoost model written in Python. How can I scale it with Spark?

Data Engineering

872 Views
1 replies
0 kudos

06-10-2021 10:53:28 AM

View Replies

Latest Reply

j_weaver
New Contributor III

06-10-2021 11:41:45 AM

0 kudos

If you are talking about distributed training of a single XGBoost model, there is no built-in capability in SparkML. SparkML supports gradient boosted trees, but not XGBoost specifically. However, there are 3rd party packages, such as XGBoost4J that ...

0 kudos

06-10-2021 11:41:45 AM

by j_weaver • New Contributor III

06-10-2021 10:59:37 AM

1052 Views
1 replies
0 kudos

Resolved! How can I scale my neural network with spark? I'm building a fully connected tensorflow.keras model.

Data Engineering

1052 Views
1 replies
0 kudos

06-10-2021 10:59:37 AM

View Replies

Latest Reply

User16788317454
New Contributor III

06-10-2021 11:35:04 AM

0 kudos

With Spark, there are a few ways you can scale your model: TrainingHyperparameter tuningInferenceIf you're looking to train one model across multiple workers, you can leverage Horovod. It's an open source project designed to simplify distributed neur...

0 kudos

06-10-2021 11:35:04 AM

by jose_gonzalez • Moderator

06-04-2021 11:54:24 AM

882 Views
2 replies
0 kudos

Cluster goes unresponsive after installing a library

Right after I install a library in my cluster, my cluster goes unresponsive and nothing runs. How to solve this issue?

Data Engineering

882 Views
2 replies
0 kudos

06-04-2021 11:54:24 AM

View Replies

Latest Reply

jose_gonzalez
Moderator

06-10-2021 11:31:22 AM

0 kudos

it is a standard cluster. It is happening for all libraries. is there a way to debug or show the errors messages if any?

0 kudos

06-10-2021 11:31:22 AM

1 More Replies

by j_weaver • New Contributor III

06-10-2021 10:57:03 AM

865 Views
1 replies
0 kudos

Resolved! When should I use pandas, Pyspark, and Koalas?

Data Engineering

865 Views
1 replies
0 kudos

06-10-2021 10:57:03 AM

View Replies

Latest Reply

User16752246141
New Contributor III

06-10-2021 10:59:14 AM

0 kudos

Pandas works for single machine computations, so any pandas code you write on Databricks will run on the driver of the cluster. Pyspark and Koalas are both distributed frameworks for when you have large datasets. You can use Pyspark and Koalas inte...

0 kudos

06-10-2021 10:59:14 AM

by Joseph_B • New Contributor III

06-09-2021 5:51:24 PM

665 Views
1 replies
0 kudos

When doing hyperparameter tuning with Hyperopt, when should I use SparkTrials? Does it work with both single-machine ML (like sklearn) and distributed ML (like Apache Spark ML)?

I want to know how to use Hyperopt in different situations:Tuning a single-machine algorithm from scikit-learn or single-node TensorFlowTuning a distributed algorithm from Spark ML or distributed TensorFlow / Horovod

Data Engineering

665 Views
1 replies
0 kudos

06-09-2021 5:51:24 PM

View Replies

Latest Reply

Joseph_B
New Contributor III

06-09-2021 5:56:20 PM

0 kudos

The right question to ask is indeed: Is the algorithm you want to tune single-machine or distributed?If it's a single-machine algorithm like any from scikit-learn, then you can use SparkTrials with Hyperopt to distribute hyperparameter tuning.If it's...

0 kudos

06-09-2021 5:56:20 PM

by FrancisLau1897 • New Contributor

08-03-2018 10:35:22 AM

17057 Views
7 replies
0 kudos

Getting "java.lang.ClassNotFoundException: Failed to find data source: xml" error when loading XML

Both the following commands fail df1 = sqlContext.read.format("xml").load(loadPath) df2 = sqlContext.read.format("com.databricks.spark.xml").load(loadPath) with the following error message: java.lang.ClassNotFoundException: Failed to find data sour...

Data Engineering

17057 Views
7 replies
0 kudos

08-03-2018 10:35:22 AM

View Replies

Latest Reply

alvaroagx
New Contributor II

06-09-2021 4:39:44 PM

0 kudos

Hi, If you are getting this error is due com.sun.xml.bind library is obsolete now. You need to download org.jvnet.jaxb2.maven package into a library by using Maven Central and attach that into a cluster. Then you are going to be able to use xml...

0 kudos

06-09-2021 4:39:44 PM

6 More Replies

by User16826988857 • New Contributor

06-09-2021 2:18:10 PM

1595 Views
0 replies
0 kudos

How to allow Table deletion without requiring ownership on table? Problem Description In DBR 6 (and earlier), a non-admin user can delete a table that...

How to allow Table deletion without requiring ownership on table?Problem DescriptionIn DBR 6 (and earlier), a non-admin user can delete a table that the user doesn't own, as long as the user has ownership on the table's parent database (perhaps throu...

Data Engineering

1595 Views
0 replies
0 kudos

06-09-2021 2:18:10 PM

by Anonymous • Not applicable

06-08-2021 7:26:50 PM

7125 Views
1 replies
0 kudos

Resolved! Ideal number and size of partitions

Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query. Conversely, the 200 partitions might be too small if the data is big. So ho...

Data Engineering

7125 Views
1 replies
0 kudos

06-08-2021 7:26:50 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-09-2021 3:35:00 AM

0 kudos

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the valueHowever if you have multiple workloads with different data volumes, instead ...

0 kudos

06-09-2021 3:35:00 AM

by Anonymous • Not applicable

06-08-2021 9:51:32 AM

2160 Views
2 replies
1 kudos

Resolved! Difference between Delta Live Tables and Multitask Jobs

When should I use one over the other? There seems to be an overlap of some functionality

Data Engineering

2160 Views
2 replies
1 kudos

06-08-2021 9:51:32 AM

View Replies

Latest Reply

aladda
Honored Contributor II

06-09-2021 3:00:00 AM

1 kudos

Delta Live Tables are targeted towards building an ETL pipeline where several Delta tables are interconnected from a flow perspective and in a single notebook. Multi-task Jobs is more generic orchestration framework that allows you to execute various...

1 kudos

06-09-2021 3:00:00 AM

1 More Replies

by christys • Community Manager

06-08-2021 3:28:31 PM

440 Views
0 replies
1 kudos

Testing image display

Data Engineering

440 Views
0 replies
1 kudos

06-08-2021 3:28:31 PM

by User16783855117 • Contributor II

06-08-2021 3:27:14 PM

567 Views
0 replies
0 kudos

Is there a way to know if Adaptive Query Execution with Spark 3 has changed my Spark plan?

From the demo notebook located here (https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html) it seems like the approach to demonstrate AQE was working was to first calculate the Spark query plan before r...

Data Engineering

567 Views
0 replies
0 kudos

06-08-2021 3:27:14 PM

by User16783853906 • Contributor III

06-08-2021 3:21:14 PM

707 Views
0 replies
0 kudos

ADLS Gen2 access with folder ACL

How to access data in ADLS Gen2 from Databricks cluster with only folder level ACL?

Data Engineering

707 Views
0 replies
0 kudos

06-08-2021 3:21:14 PM

by RonanStokes_DB • New Contributor III

06-08-2021 10:09:10 AM

808 Views
1 replies
0 kudos

How can I prevent users from consuming excessive costs for jobs?

If users are allowed to create clusters, how can an operations team prevent them from consuming excessive costs?

Data Engineering

808 Views
1 replies
0 kudos

06-08-2021 10:09:10 AM

View Replies

Latest Reply

RonanStokes_DB
New Contributor III

06-08-2021 10:10:06 AM

0 kudos

Cluster policies can be used to constrain the node types that are available to users to create clusters, the number of nodes they can use to create clusters and the max DBU consumption they can use.The following resources provide further information:...

0 kudos

06-08-2021 10:10:06 AM

by RonanStokes_DB • New Contributor III

06-08-2021 10:03:55 AM

582 Views
0 replies
0 kudos

What are the applicable limits when using airflow to co-ordinate execution of jobs

When using Airflow to co-ordinate execution of jobs, what are the applicable limits?

Data Engineering

582 Views
0 replies
0 kudos

06-08-2021 10:03:55 AM

User

Count

1602

737

348

285

247

Databricks Community

Forum Posts

Saving display() plots

Resolved! I have a single node XGBoost model written in Python. How can I scale it with Spark?

Resolved! How can I scale my neural network with spark? I'm building a fully connected tensorflow.keras model.

Cluster goes unresponsive after installing a library

Resolved! When should I use pandas, Pyspark, and Koalas?

When doing hyperparameter tuning with Hyperopt, when should I use SparkTrials? Does it work with both single-machine ML (like sklearn) and distributed ML (like Apache Spark ML)?

Getting "java.lang.ClassNotFoundException: Failed to find data source: xml" error when loading XML

How to allow Table deletion without requiring ownership on table? Problem Description In DBR 6 (and earlier), a non-admin user can delete a table that...

Resolved! Ideal number and size of partitions

Resolved! Difference between Delta Live Tables and Multitask Jobs

Testing image display

Is there a way to know if Adaptive Query Execution with Spark 3 has changed my Spark plan?

ADLS Gen2 access with folder ACL

How can I prevent users from consuming excessive costs for jobs?

What are the applicable limits when using airflow to co-ordinate execution of jobs

Getting com.databricks.client.jdbc.Driver is not f...

Unit Testing DLT Pipelines

Retrieve job-level parameters in spark_python_task...

Cannot pass arrays to spark.sql() using named para...

unity catalog with external table and column maski...