cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Anonymous
by Not applicable
  • 2320 Views
  • 1 replies
  • 0 kudos
  • 2320 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

In this scenario, the best option would be to have a single readStream reading a source delta table. Since checkpoint logs are controlled when writing to delta tables you would be able to maintain separate logs for each of your writeStreams. I would...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 1157 Views
  • 1 replies
  • 0 kudos

Major changes in spark 3.0

What are the major changes released in spark 3.0

  • 1157 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

Check out https://spark.apache.org/docs/latest/sql-migration-guide.html if you're looking for potentially breaking changes you need to be aware of, for any version.For a general overview of the new features, see https://databricks.com/blog/2020/06/18...

  • 0 kudos
User16857281869
by New Contributor II
  • 1513 Views
  • 1 replies
  • 0 kudos

How do I benefit from parallelisation when doing machine learning?

There are in principle four distinct ways of using parallelisation when doing machine learning. Any combination of these can speed up the whole pipeline significantly.1) Using spark distributed processing in feature engineering 2) When the data set...

  • 1513 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

Good summary! yes those are the main strategies I can think of.

  • 0 kudos
User16826992666
by Valued Contributor
  • 2247 Views
  • 2 replies
  • 0 kudos
  • 2247 Views
  • 2 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

You do not have to cache anything to make it work. You would decide that based on whether you want to spend memory/storage to avoid recomputing the DataFrame, like when you may use it in multiple operations afterwards.

  • 0 kudos
1 More Replies
User16826992666
by Valued Contributor
  • 4762 Views
  • 1 replies
  • 0 kudos

Resolved! What's the difference between SparkML and Spark MLlib?

I have heard people talk about SparkML but when reading documentation it talks about MLlib. I don't understand the difference, could anyone help me understand this?

  • 4762 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

They're not really different. Before DataFrames in Spark, older implementations of ML algorithms build on the RDD API. This is generally called "Spark MLlib". After DataFrames, some newer implementations were added as wrappers on top of the old ones ...

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 4751 Views
  • 1 replies
  • 1 kudos
  • 4751 Views
  • 1 replies
  • 1 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 1 kudos

You could set up dnsmasq to configure  routing between your Databricks workspace and your on-premise network. More details here

  • 1 kudos
sajith_appukutt
by Honored Contributor II
  • 1979 Views
  • 1 replies
  • 0 kudos
  • 1979 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Databricks allows network customizations / hardening from a security point of view to reduce risks like Data exfiltration. For more detailsData Exfiltration Protection With Databricks on AWSData Exfiltration Protection with Azure Databricks

  • 0 kudos
User16826994223
by Honored Contributor III
  • 1897 Views
  • 1 replies
  • 0 kudos

Z ordering best practices

What are the best practices around Z ordering, Should be include as Manu column as Possible in Z order or lesser the better and why?

  • 1897 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

With Z-order and Hilbert curves, the effectiveness of clustering decreases with each column added - so you'd want to zorder only the columns that you's actually use so that it's speed up your workloads.

  • 0 kudos
Srikanth_Gupta_
by Databricks Employee
  • 1380 Views
  • 1 replies
  • 0 kudos
  • 1380 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

coalesce avoids a full shuffle and could be used to decrease the number of partitionsrepartition results in a full shuffle and could be used to increase or decrease the number of partitions

  • 0 kudos
User16776430979
by New Contributor III
  • 2832 Views
  • 1 replies
  • 0 kudos

Repos branch control – how can we configure a job to run a specific branch?

For example, how can we ensure our jobs always run off the main/master branch?

  • 2832 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16781336501
Databricks Employee
  • 0 kudos

We recommend having a top level folder to run jobs against. Best practice detailed here: https://docs.databricks.com/repos.html#best-practices-for-integrating-repos-with-cicd-workflows

  • 0 kudos
User16830818469
by New Contributor
  • 2093 Views
  • 2 replies
  • 0 kudos

Repos integration

Does repos work with on-prem/enterprise bit bucket?

  • 2093 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16781336501
Databricks Employee
  • 0 kudos

If you have a private git server (e.g. behind VPN, IP whitelist), you will need to be enrolled in the git proxy private preview to use Repos, please contact your account team.

  • 0 kudos
1 More Replies
User16826994223
by Honored Contributor III
  • 1066 Views
  • 1 replies
  • 0 kudos

Time stamp changes in spark sql

Hi Team Is there a way to change the current timestamp from the current time zone to a different time zone .

  • 1066 Views
  • 1 replies
  • 0 kudos
Latest Reply
Srikanth_Gupta_
Databricks Employee
  • 0 kudos

import sqlContext.implicits._import org.apache.spark.sql.functions._inputDF.select(   unix_timestamp($"unix_timestamp").alias("unix_timestamp"),   from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType), "UTC").alias("UTC"),   from_utc_tim...

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels