cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

brickster_2018
by Esteemed Contributor
  • 2059 Views
  • 1 replies
  • 1 kudos

Resolved! How to run commands on the executor

Using %sh, I am able to run commands on the notebook and get output. How can i run a command on the executor and get the output. I want to avoid using the Spark API's

  • 2059 Views
  • 1 replies
  • 1 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 1 kudos

It's not possible to use %sh to run commands on the executor. The below code can be used to run commands on the executor and get the outputvar res=sc.runOnEachExecutor[String]({ () => import sys.process._ var cmd_Result=Seq("bash", "-c", "h...

  • 1 kudos
brickster_2018
by Esteemed Contributor
  • 3152 Views
  • 1 replies
  • 0 kudos
  • 3152 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

The off-heap memory is managed outside the executor JVM. Spark has native support to use off-heap memory. The off-heap memory is managed by Spark and not controlled by the executor JVM. Hence GC cycles on the executor do not clean up off-heap. Databr...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 749 Views
  • 1 replies
  • 0 kudos
  • 749 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

1. We want a venue in which we can rapidly iterate and make new releases. The overhead of making a release as a separate project is minuscule (in the order of minutes). A release on Spark takes a lot longer (in the order of days)2. Koalas takes a dif...

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 1628 Views
  • 1 replies
  • 0 kudos

Resolved! Is it recommended to turn on Spark speculative execution permanently

I had a job where the last step will get stuck forever. Turning on spark speculative execution did magic and resolved the issue. Is it safe to turn on Spark speculative execution permanently.

  • 1628 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

It's not recommended to turn of Spark speculative execution permanently. For jobs where tasks are running slow or stuck because of transient network or storage issues, speculative execution can be very handy. However, it suppresses the actual problem...

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 950 Views
  • 1 replies
  • 2 kudos

Few things you should not do in Databricks!

Few things you should not do in Databricks!

  • 950 Views
  • 1 replies
  • 2 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 2 kudos

Compared to OSS Spark, these are few things the users don't have to worry about when running the same job on Databricks. Memory management: Databricks use an internal formula to allocate the Driver and executor heap based on the size of the instance....

  • 2 kudos
brickster_2018
by Esteemed Contributor
  • 4301 Views
  • 1 replies
  • 2 kudos

Resolved! Databricks Spark Vs Spark on Yarn

I am moving my Spark workloads from EMR/on-premise Spark cluster to Databricks. I understand Databricks Spark is different from Yarn. How is the Databricks architecture different from yarn?

  • 4301 Views
  • 1 replies
  • 2 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 2 kudos

Users often compare Databricks cluster vs Yarn Cluster. It's not an Apple to Apple comparison. A Databricks cluster should be compared to a Spark Application that is submitted on Yarn. A Spark Application on Yarn will have a driver container and exe...

  • 2 kudos
brickster_2018
by Esteemed Contributor
  • 5499 Views
  • 1 replies
  • 0 kudos

Resolved! Why do I always see "Executor heartbeat timed out" messages in the Spark Driver logs

Often, I see "Executor heartbeat timed out" messages in the Spark driver logs. Sometimes job fails with this error. Will increasing "spark.executor.heartbeatInterval" help to mitigate the issue ?

  • 5499 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

This is a common misconception that increasing "spark.executor.heartbeatInterval" will help to mitigate or resolve the heartbeat issues. In fact, increasing the spark.executor.heartbeatInterval will increase the chance of the error and worse the situ...

  • 0 kudos
User15787040559
by New Contributor III
  • 1187 Views
  • 1 replies
  • 0 kudos

How to translate Apache Pig FOREACH GENERATE statement to Spark?

If you have the following Apache Pig FOREACH GENERATE statement:XBCUD_Y_TMP1 = FOREACH (FILTER XBCUD BY act_ind == 'Y') GENERATE cust_hash_key,CONCAT(brd_abbr_cd,ctry_cd) as brdCtry:chararray,updt_dt_hash_key;the equivalent code in Apache Spark is:XB...

  • 1187 Views
  • 1 replies
  • 0 kudos
Latest Reply
User15725630784
New Contributor II
  • 0 kudos

the equivalent code in Apache Spark is:XBCUD_Y_TMP1_DF = (XBCUD_DF .filter(col("act_ind") == "Y") .select(col("cust_hash_key"), concat(col("brd_abbr_cd"),col("ctry_cd")).alias("brdCtry"), col("updt_dt_hash_key")) )

  • 0 kudos
Srikanth_Gupta_
by Valued Contributor
  • 1066 Views
  • 1 replies
  • 0 kudos
  • 1066 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Delta cache is an automatic hands-free solution that leverages high read speeds of modern SSDs to transparently create copies of remote files in nodes’ local storage to accelerate data reads . In comparison, you have choose what and when to cache wit...

  • 0 kudos
Joseph_B
by New Contributor III
  • 1797 Views
  • 1 replies
  • 0 kudos
  • 1797 Views
  • 1 replies
  • 0 kudos
Latest Reply
Joseph_B
New Contributor III
  • 0 kudos

You can implement custom algorithms for GraphFrames using either Scala/Java or Python APIs. GraphFrames provides some structures to simplify writing graph algorithms; the three primary options are as follow, with the best options first:Pregel: This i...

  • 0 kudos
dheeraj
by New Contributor II
  • 5004 Views
  • 3 replies
  • 0 kudos

How to calculate Percentile of column in a DataFrame in spark?

I am trying to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions. For e.g. in Hive we have percentile_approx and we can use it in the following way hiveContext.sql("select per...

  • 5004 Views
  • 3 replies
  • 0 kudos
Latest Reply
amandaphy
New Contributor II
  • 0 kudos

You can try using df.registerTempTable("tmp_tbl") val newDF = sql(/ do something with tmp_tbl /)// and continue using newDF Learn More

  • 0 kudos
2 More Replies
Labels