cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Joseph_B
by Databricks Employee
  • 2312 Views
  • 1 replies
  • 0 kudos

When doing hyperparameter tuning with Hyperopt, when should I use SparkTrials? Does it work with both single-machine ML (like sklearn) and distributed ML (like Apache Spark ML)?

I want to know how to use Hyperopt in different situations:Tuning a single-machine algorithm from scikit-learn or single-node TensorFlowTuning a distributed algorithm from Spark ML or distributed TensorFlow / Horovod

  • 2312 Views
  • 1 replies
  • 0 kudos
Latest Reply
Joseph_B
Databricks Employee
  • 0 kudos

The right question to ask is indeed: Is the algorithm you want to tune single-machine or distributed?If it's a single-machine algorithm like any from scikit-learn, then you can use SparkTrials with Hyperopt to distribute hyperparameter tuning.If it's...

  • 0 kudos
FrancisLau1897
by New Contributor
  • 24419 Views
  • 7 replies
  • 0 kudos

Getting "java.lang.ClassNotFoundException: Failed to find data source: xml" error when loading XML

Both the following commands fail df1 = sqlContext.read.format("xml").load(loadPath) df2 = sqlContext.read.format("com.databricks.spark.xml").load(loadPath) with the following error message: java.lang.ClassNotFoundException: Failed to find data sour...

  • 24419 Views
  • 7 replies
  • 0 kudos
Latest Reply
alvaroagx
New Contributor II
  • 0 kudos

Hi, If you are getting this error is due com.sun.xml.bind library is obsolete now. You need to download org.jvnet.jaxb2.maven package into a library by using Maven Central and attach that into a cluster. Then you are going to be able to use xml...

  • 0 kudos
6 More Replies
Digan_Parikh
by Databricks Employee
  • 2571 Views
  • 0 replies
  • 0 kudos

Widgets - Way to validate config parameters

Yes, you can use the widgets api to have some control to validate the input before you pass the values to the rest of your codeFor example:folder = dbutils.widgets.get("Folder") if folder == "": raise Exception("Folder missing")or to get spark se...

  • 2571 Views
  • 0 replies
  • 0 kudos
Anonymous
by Not applicable
  • 19139 Views
  • 1 replies
  • 0 kudos

Resolved! Ideal number and size of partitions

Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query. Conversely, the 200 partitions might be too small if the data is big. So ho...

  • 19139 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Databricks Employee
  • 0 kudos

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the valueHowever if you have multiple workloads with different data volumes, instead ...

  • 0 kudos
Anonymous
by Not applicable
  • 4533 Views
  • 2 replies
  • 1 kudos

Resolved! Difference between Delta Live Tables and Multitask Jobs

When should I use one over the other? There seems to be an overlap of some functionality

  • 4533 Views
  • 2 replies
  • 1 kudos
Latest Reply
aladda
Databricks Employee
  • 1 kudos

Delta Live Tables are targeted towards building an ETL pipeline where several Delta tables are interconnected from a flow perspective and in a single notebook. Multi-task Jobs is more generic orchestration framework that allows you to execute various...

  • 1 kudos
1 More Replies
User16783855117
by Databricks Employee
  • 2075 Views
  • 0 replies
  • 0 kudos

Is there a way to know if Adaptive Query Execution with Spark 3 has changed my Spark plan?

From the demo notebook located here (https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html) it seems like the approach to demonstrate AQE was working was to first calculate the Spark query plan before r...

  • 2075 Views
  • 0 replies
  • 0 kudos
RonanStokes_DB
by Databricks Employee
  • 1953 Views
  • 1 replies
  • 0 kudos

How can I prevent users from consuming excessive costs for jobs?

If users are allowed to create clusters, how can an operations team prevent them from consuming excessive costs?

  • 1953 Views
  • 1 replies
  • 0 kudos
Latest Reply
RonanStokes_DB
Databricks Employee
  • 0 kudos

Cluster policies can be used to constrain the node types that are available to users to create clusters, the number of nodes they can use to create clusters and the max DBU consumption they can use.The following resources provide further information:...

  • 0 kudos
User16826994223
by Databricks Employee
  • 1458 Views
  • 1 replies
  • 0 kudos

Stream is not getting started from kafka after 2 hours of cluster statrt

Hi Team I am setting up the Kafka cluster on databricks to ingest the data on delta, but it seems like the cluster is running from last 2 hours but still, the stream is not started and I am not seeing any failure also.

  • 1458 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Databricks Employee
  • 0 kudos

This Type of issue happens if you have firewall on cloud account and your ip is not whitelisted, so pleaae whitelist the ip and issue will resolve

  • 0 kudos
User16783853032
by Databricks Employee
  • 3178 Views
  • 1 replies
  • 0 kudos

Databricks notebook command gets cancelled:Generally when cluster is having init scripts or lib issues while starting cluster. Exact error can be look...

Databricks notebook command gets cancelled:Generally when cluster is having init scripts or lib issues while starting cluster. Exact error can be looked into driver logs.

Screen Shot 2021-06-07 at 2.42.14 PM Screen Shot 2021-06-07 at 2.45.22 PM
  • 3178 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Databricks Employee
  • 0 kudos

Awsome Knowledge

  • 0 kudos
User16826994223
by Databricks Employee
  • 2059 Views
  • 1 replies
  • 0 kudos

Azure Databricks with Storage Account as data layer and DBFS understanding

What is the difference between ADLS mounted ON DataBricks and dbfs does the Mount of ADLS on databricks make gives any performance benefit , is the mounted ADLS still behave as object storage or it become simple storage

  • 2059 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Databricks Employee
  • 0 kudos

DBFS is just an abstraction on cloud storage By default when you create a workspace, you get an instance of DBFS - so-called DBFS Root. Plus you can mount additional storage accounts under the /mnt folder. Data written to mount point paths (/mnt) is...

  • 0 kudos
User16826994223
by Databricks Employee
  • 6880 Views
  • 1 replies
  • 0 kudos

How to conver Dataframe into JSON on Databricks?

Can I convert my jdbc Dataframe into JSON ? Because when I tried it, it got an error. I'm using a script as Pandas DataFrame function df.to_json()

  • 6880 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Databricks Employee
  • 0 kudos

df.toJSON()

  • 0 kudos
Labels