cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User16826994223
by Honored Contributor III
  • 1099 Views
  • 1 replies
  • 0 kudos
  • 1099 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

1. We want a venue in which we can rapidly iterate and make new releases. The overhead of making a release as a separate project is minuscule (in the order of minutes). A release on Spark takes a lot longer (in the order of days)2. Koalas takes a dif...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 2677 Views
  • 1 replies
  • 0 kudos

Resolved! Is it recommended to turn on Spark speculative execution permanently

I had a job where the last step will get stuck forever. Turning on spark speculative execution did magic and resolved the issue. Is it safe to turn on Spark speculative execution permanently.

  • 2677 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

It's not recommended to turn of Spark speculative execution permanently. For jobs where tasks are running slow or stuck because of transient network or storage issues, speculative execution can be very handy. However, it suppresses the actual problem...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 1497 Views
  • 1 replies
  • 2 kudos

Few things you should not do in Databricks!

Few things you should not do in Databricks!

  • 1497 Views
  • 1 replies
  • 2 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 2 kudos

Compared to OSS Spark, these are few things the users don't have to worry about when running the same job on Databricks. Memory management: Databricks use an internal formula to allocate the Driver and executor heap based on the size of the instance....

  • 2 kudos
brickster_2018
by Databricks Employee
  • 9290 Views
  • 1 replies
  • 0 kudos

Resolved! Why do I always see "Executor heartbeat timed out" messages in the Spark Driver logs

Often, I see "Executor heartbeat timed out" messages in the Spark driver logs. Sometimes job fails with this error. Will increasing "spark.executor.heartbeatInterval" help to mitigate the issue ?

  • 9290 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

This is a common misconception that increasing "spark.executor.heartbeatInterval" will help to mitigate or resolve the heartbeat issues. In fact, increasing the spark.executor.heartbeatInterval will increase the chance of the error and worse the situ...

  • 0 kudos
User15787040559
by Databricks Employee
  • 1742 Views
  • 1 replies
  • 0 kudos

How to translate Apache Pig FOREACH GENERATE statement to Spark?

If you have the following Apache Pig FOREACH GENERATE statement:XBCUD_Y_TMP1 = FOREACH (FILTER XBCUD BY act_ind == 'Y') GENERATE cust_hash_key,CONCAT(brd_abbr_cd,ctry_cd) as brdCtry:chararray,updt_dt_hash_key;the equivalent code in Apache Spark is:XB...

  • 1742 Views
  • 1 replies
  • 0 kudos
Latest Reply
User15725630784
Databricks Employee
  • 0 kudos

the equivalent code in Apache Spark is:XBCUD_Y_TMP1_DF = (XBCUD_DF .filter(col("act_ind") == "Y") .select(col("cust_hash_key"), concat(col("brd_abbr_cd"),col("ctry_cd")).alias("brdCtry"), col("updt_dt_hash_key")) )

  • 0 kudos
Srikanth_Gupta_
by Databricks Employee
  • 1823 Views
  • 1 replies
  • 0 kudos
  • 1823 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Delta cache is an automatic hands-free solution that leverages high read speeds of modern SSDs to transparently create copies of remote files in nodes’ local storage to accelerate data reads . In comparison, you have choose what and when to cache wit...

  • 0 kudos
Joseph_B
by Databricks Employee
  • 2713 Views
  • 1 replies
  • 0 kudos
  • 2713 Views
  • 1 replies
  • 0 kudos
Latest Reply
Joseph_B
Databricks Employee
  • 0 kudos

You can implement custom algorithms for GraphFrames using either Scala/Java or Python APIs. GraphFrames provides some structures to simplify writing graph algorithms; the three primary options are as follow, with the best options first:Pregel: This i...

  • 0 kudos
dheeraj
by New Contributor II
  • 6229 Views
  • 3 replies
  • 0 kudos

How to calculate Percentile of column in a DataFrame in spark?

I am trying to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions. For e.g. in Hive we have percentile_approx and we can use it in the following way hiveContext.sql("select per...

  • 6229 Views
  • 3 replies
  • 0 kudos
Latest Reply
amandaphy
New Contributor II
  • 0 kudos

You can try using df.registerTempTable("tmp_tbl") val newDF = sql(/ do something with tmp_tbl /)// and continue using newDF Learn More

  • 0 kudos
2 More Replies
Labels