cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User16765131552
by Contributor III
  • 577 Views
  • 0 replies
  • 0 kudos

docs.databricks.com

Best practices for Databricks pools — Databricks DocumentationLearn best practices for configuring and using Databricks pools.https://docs.databricks.com/clusters/instance-pools/pool-best-practices.htmlBest practices for Azure Databricks pools - Azur...

  • 577 Views
  • 0 replies
  • 0 kudos
User16765131552
by Contributor III
  • 429 Views
  • 0 replies
  • 0 kudos

docs.databricks.com

Best practices: Cluster configuration | Databricks on AWSLearn best practices when creating and configuring Databricks clusters.https://docs.databricks.com/clusters/cluster-config-best-practices.html

  • 429 Views
  • 0 replies
  • 0 kudos
User16765131552
by Contributor III
  • 439 Views
  • 0 replies
  • 0 kudos

docs.gcp.databricks.com

Best practices | Databricks on Google CloudLearn best practices when using or administering Databricks.https://docs.gcp.databricks.com/best-practices-index.html

  • 439 Views
  • 0 replies
  • 0 kudos
User16765131552
by Contributor III
  • 413 Views
  • 0 replies
  • 0 kudos

docs.microsoft.com

Best practices - Azure DatabricksLearn best practices when using or administering Azure Databricks.https://docs.microsoft.com/en-us/azure/databricks/best-practices-index

  • 413 Views
  • 0 replies
  • 0 kudos
User16765131552
by Contributor III
  • 468 Views
  • 0 replies
  • 0 kudos

docs.databricks.com

Best practices | Databricks on AWSLearn best practices when using or administering Databricks.https://docs.databricks.com/best-practices-index.html

  • 468 Views
  • 0 replies
  • 0 kudos
User16826994223
by Honored Contributor III
  • 916 Views
  • 0 replies
  • 0 kudos

Best practices: Hyperparameter tuning with Hyperopt Bayesian approaches can be much more efficient than grid search and random search. Hence, with the...

Best practices: Hyperparameter tuning with HyperoptBayesian approaches can be much more efficient than grid search and random search. Hence, with the Hyperopt Tree of Parzen Estimators (TPE) algorithm, you can explore more hyperparameters and larger ...

  • 916 Views
  • 0 replies
  • 0 kudos
User16783853501
by Databricks Employee
  • 1279 Views
  • 2 replies
  • 0 kudos

What is the best way to convert a very large parquet table to delta ? possibly without downtime!

What is the best way to convert a very large parquet table to delta ? possibly without downtime! 

  • 1279 Views
  • 2 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

I vouch for Sajith's answer. The main advantage with "CONVERT TO DELTA" is that operations are metadata centric which means we are not reading the full data for the conversion. For any other file format conversion, it's necessary to read the data com...

  • 0 kudos
1 More Replies
User16783853501
by Databricks Employee
  • 1681 Views
  • 0 replies
  • 1 kudos

Databricks Autoloader Best practice

Databricks Autoloader is a popular mechanism for ingesting data/files from cloud storage into Delta; for a very high throughput source, what are the best practices to be following while scaling up an autoloader based pipeline to the tune of millions ...

  • 1681 Views
  • 0 replies
  • 1 kudos
User16789201666
by Databricks Employee
  • 1337 Views
  • 0 replies
  • 0 kudos

Hyperopt, how to setup hyper-parameter for categorical vs numerical hyperparameter?

 hp.quniform (“quantized uniform”) or hp.qloguniform to generate integers. hp.choice is the right choice when, for example, choosing among categorical choices (which might in some situations even be integers, but not usually).https://databricks.com/b...

  • 1337 Views
  • 0 replies
  • 0 kudos
aladda
by Databricks Employee
  • 37152 Views
  • 2 replies
  • 1 kudos
  • 37152 Views
  • 2 replies
  • 1 kudos
Latest Reply
aladda
Databricks Employee
  • 1 kudos

Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. Syntax fo...

  • 1 kudos
1 More Replies
Srikanth_Gupta_
by Valued Contributor
  • 1288 Views
  • 2 replies
  • 1 kudos

What are Best Practices for Spark streaming in Databricks

What are best practices for Spark streaming in Databricksis it good idea to consume multiple topics in one streaming jobis Auto scaling recommended for spark streamingHow many worker nodes we should choose for streaming jobWhen should we run OPTIMIZE...

  • 1288 Views
  • 2 replies
  • 1 kudos
Latest Reply
craig_ng
New Contributor III
  • 1 kudos

See our docs for other considerations when deploying a production streaming job.

  • 1 kudos
1 More Replies
User16752240150
by New Contributor II
  • 1305 Views
  • 1 replies
  • 0 kudos

What's the best way to use hyperopt to train a spark.ml model and track automatically with mlflow?

I've read this article, which covers:Using CrossValidator or TrainValidationSplit to track hyperparameter tuning (no hyperopt). Only random/grid searchparallel "single-machine" model training with hyperopt using hyperopt.SparkTrials (not spark.ml)"Di...

  • 1305 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

It's actually pretty simple: use hyperopt, but use "Trials" not "SparkTrials". You get parallelism from Spark, not from the tuning process.

  • 0 kudos
User16826994223
by Honored Contributor III
  • 969 Views
  • 1 replies
  • 0 kudos

Z ordering best practices

What are the best practices around Z ordering, Should be include as Manu column as Possible in Z order or lesser the better and why?

  • 969 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

With Z-order and Hilbert curves, the effectiveness of clustering decreases with each column added - so you'd want to zorder only the columns that you's actually use so that it's speed up your workloads.

  • 0 kudos
Labels