cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User16826992666
by Valued Contributor
  • 3908 Views
  • 1 replies
  • 0 kudos

Resolved! What's the difference between SparkML and Spark MLlib?

I have heard people talk about SparkML but when reading documentation it talks about MLlib. I don't understand the difference, could anyone help me understand this?

  • 3908 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

They're not really different. Before DataFrames in Spark, older implementations of ML algorithms build on the RDD API. This is generally called "Spark MLlib". After DataFrames, some newer implementations were added as wrappers on top of the old ones ...

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 3979 Views
  • 1 replies
  • 1 kudos
  • 3979 Views
  • 1 replies
  • 1 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 1 kudos

You could set up dnsmasq to configure  routing between your Databricks workspace and your on-premise network. More details here

  • 1 kudos
sajith_appukutt
by Honored Contributor II
  • 1608 Views
  • 1 replies
  • 0 kudos
  • 1608 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Databricks allows network customizations / hardening from a security point of view to reduce risks like Data exfiltration. For more detailsData Exfiltration Protection With Databricks on AWSData Exfiltration Protection with Azure Databricks

  • 0 kudos
User16826994223
by Honored Contributor III
  • 1044 Views
  • 1 replies
  • 0 kudos

Z ordering best practices

What are the best practices around Z ordering, Should be include as Manu column as Possible in Z order or lesser the better and why?

  • 1044 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

With Z-order and Hilbert curves, the effectiveness of clustering decreases with each column added - so you'd want to zorder only the columns that you's actually use so that it's speed up your workloads.

  • 0 kudos
Srikanth_Gupta_
by Databricks Employee
  • 1125 Views
  • 1 replies
  • 0 kudos
  • 1125 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

coalesce avoids a full shuffle and could be used to decrease the number of partitionsrepartition results in a full shuffle and could be used to increase or decrease the number of partitions

  • 0 kudos
User16776430979
by New Contributor III
  • 2462 Views
  • 1 replies
  • 0 kudos

Repos branch control – how can we configure a job to run a specific branch?

For example, how can we ensure our jobs always run off the main/master branch?

  • 2462 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16781336501
Databricks Employee
  • 0 kudos

We recommend having a top level folder to run jobs against. Best practice detailed here: https://docs.databricks.com/repos.html#best-practices-for-integrating-repos-with-cicd-workflows

  • 0 kudos
User16830818469
by New Contributor
  • 1788 Views
  • 2 replies
  • 0 kudos

Repos integration

Does repos work with on-prem/enterprise bit bucket?

  • 1788 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16781336501
Databricks Employee
  • 0 kudos

If you have a private git server (e.g. behind VPN, IP whitelist), you will need to be enrolled in the git proxy private preview to use Repos, please contact your account team.

  • 0 kudos
1 More Replies
User16826994223
by Honored Contributor III
  • 920 Views
  • 1 replies
  • 0 kudos

Time stamp changes in spark sql

Hi Team Is there a way to change the current timestamp from the current time zone to a different time zone .

  • 920 Views
  • 1 replies
  • 0 kudos
Latest Reply
Srikanth_Gupta_
Databricks Employee
  • 0 kudos

import sqlContext.implicits._import org.apache.spark.sql.functions._inputDF.select(   unix_timestamp($"unix_timestamp").alias("unix_timestamp"),   from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType), "UTC").alias("UTC"),   from_utc_tim...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 1093 Views
  • 1 replies
  • 0 kudos

I understand Spark Streaming uses micro-batching. Does this increase latency?

I understand Spark Streaming uses micro-batching. Does this increase latency?

  • 1093 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

While Spark does use a micro-batch execution model, this does not have much impact on applications, because the batches can be as short as 0.5 seconds. In most applications of streaming big data, the analytics is done over a larger window (say 10 min...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 580 Views
  • 0 replies
  • 0 kudos

Why Unity Catalouge ?Fine-grained permissions: Unity Catalog can enforce permissions for data at the row, column or view level instead of the file lev...

Why Unity Catalouge ?Fine-grained permissions: Unity Catalog can enforce permissions for data at the row, column or view level instead of the file level, so that you can always share just part of your data with a new user without copying it.An open, ...

  • 580 Views
  • 0 replies
  • 0 kudos
User16826994223
by Honored Contributor III
  • 1727 Views
  • 1 replies
  • 0 kudos

Resolved! what are the join hints, available in spark 3.0, and how does it help compare to pervious spark version

what are the join hints, available in spark 3.0, and how does it help compare to pervious spark version 

  • 1727 Views
  • 1 replies
  • 0 kudos
Latest Reply
Srikanth_Gupta_
Databricks Employee
  • 0 kudos

4 types of join hints in Spark 3.0BROADCASTMERGESHUFFLE_HASHSHUFFLE_REPLICATE_NLMay be good idea to enable Adaptive Query Execution which speeds up Spark SQL join during run timeIn Spark 3.0, Adaptive Query Execution comes with below featuresDynamica...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 1514 Views
  • 1 replies
  • 0 kudos

How is the photon engine different to catalyst optimizer

How is the photon engine different to catalyst optimizer

  • 1514 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

I got this question from some customers and I want ti clarify here tooI think we are conflating two things:Catalyst optimizer is about coming up "Steps to take to execute the query". For example, the optimizer will decide how and when to do the join...

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels