<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Making HTTP post requests on Spark using foreachPartition in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/making-http-post-requests-on-spark-using-foreachpartition/m-p/27662#M19523</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;
 &lt;/P&gt;&lt;P&gt; Need some help to understand the behaviour of the below in Spark (using Scala and Databricks)&lt;/P&gt;
 &lt;P&gt; I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I repartitioned the dataframe to make sure each partition has no more than 1000 records. Also, created a json column for each line (so I need only to put them in an array later on)&lt;/P&gt;
 &lt;P&gt; The trouble is on the making the requests. I created the following a Serializable class using the following code&lt;/P&gt; 
 &lt;PRE&gt;&lt;CODE&gt;import org.apache.spark.sql.{DataFrame,Row}import org.apache.http.client.methods.HttpPostimport org.apache.http.impl.client.HttpClientBuilderimport org.apache.http.HttpHeadersimport org.apache.http.entity.StringEntityimport org.apache.commons.io.IOUtilsobject postObject extendsSerializable{  val client =HttpClientBuilder.create().build()  val post =newHttpPost("https://my-cool-api-&amp;lt;a href="https://applinked.me/firestick-tv/" target="_blank"&amp;gt;applinked on fire stick&amp;lt;/a&amp;gt;")  post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")def makeHttpCall(row:Iterator[Row])={      val json_str ="""{"people": ["""+ row.toSeq.map(x =&amp;gt; x.getAs[String]("json")).mkString(",")+"]}"      post.setEntity(newStringEntity(json_str))      val response = client.execute(post)      val entity = response.getEntity()      println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))      println(IOUtils.toString(entity.getContent()))}}&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;    Now when I try the following:&amp;lt;/p&amp;gt;&amp;lt;pre&amp;gt;postObject.makeHttpCall(data.head(2).toIterator)&lt;/CODE&gt;&lt;/PRE&gt;
 &lt;P&gt; It works like a charm. The requests go through, there is some output on the screen, and my API gets that data.&lt;/P&gt;
 &lt;P&gt; But when I try to put it in the foreachPartition:&lt;/P&gt; 
 &lt;PRE&gt;&lt;CODE&gt;data.foreachPartition { x =&amp;gt;  postObject.makeHttpCall(x)}&lt;/CODE&gt;&lt;/PRE&gt;
 &lt;P&gt; Nothing happens. No output on screen, nothing arrives in my API. If I try to rerun it, almost all stages just skips. I believe, for any reason, it is just lazy evaluating my requests, but not actually performing it. I don't understand why, and how to force it.&lt;/P&gt;

&lt;P&gt;Following for answers because I have similar doubt&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Sun, 27 Oct 2019 22:14:08 GMT</pubDate>
    <dc:creator>melo08</dc:creator>
    <dc:date>2019-10-27T22:14:08Z</dc:date>
    <item>
      <title>Making HTTP post requests on Spark using foreachPartition</title>
      <link>https://community.databricks.com/t5/data-engineering/making-http-post-requests-on-spark-using-foreachpartition/m-p/27661#M19522</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt; Need some help to understand the behaviour of the below in Spark (using Scala and Databricks)&lt;/P&gt;
&lt;P&gt; I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I repartitioned the dataframe to make sure each partition has no more than 1000 records. Also, created a json column for each line (so I need only to put them in an array later on)&lt;/P&gt;
&lt;P&gt; The trouble is on the making the requests. I created the following a Serializable class using the following code&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;import org.apache.spark.sql.{DataFrame, Row}
import org.apache.http.client.methods.HttpPost
import org.apache.http.impl.client.HttpClientBuilder
import org.apache.http.HttpHeaders
import org.apache.http.entity.StringEntity
import org.apache.commons.io.IOUtils
object postObject extends Serializable{
  val client = HttpClientBuilder.create().build()
  val post = new HttpPost("https://my-cool-api-endpoint")
  post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
  def makeHttpCall(row: Iterator[Row]) = {
      val json_str = """{"people": [""" + row.toSeq.map(x =&amp;gt; x.getAs[String]("json")).mkString(",") + "]}"      
      post.setEntity(new StringEntity(json_str))
      val response = client.execute(post)
      val entity = response.getEntity()
      println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))
      println(IOUtils.toString(entity.getContent()))
  }
}
&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;
    Now when I try the following:&amp;lt;/p&amp;gt;&amp;lt;pre&amp;gt;postObject.makeHttpCall(data.head(2).toIterator)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt; It works like a charm. The requests go through, there is some output on the screen, and my API gets that data.&lt;/P&gt;
&lt;P&gt; But when I try to put it in the foreachPartition:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;data.foreachPartition { x =&amp;gt; 
  postObject.makeHttpCall(x)
}&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;
    Nothing happens. No output on screen, nothing arrives in my API. If I try to rerun it, almost all stages just skips. I believe, for any reason, it is just lazy evaluating my requests, but not actually performing it. I don't understand why, and how to force it.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 23 Oct 2019 18:10:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/making-http-post-requests-on-spark-using-foreachpartition/m-p/27661#M19522</guid>
      <dc:creator>CaioIshizaka_Co</dc:creator>
      <dc:date>2019-10-23T18:10:46Z</dc:date>
    </item>
    <item>
      <title>Re: Making HTTP post requests on Spark using foreachPartition</title>
      <link>https://community.databricks.com/t5/data-engineering/making-http-post-requests-on-spark-using-foreachpartition/m-p/27662#M19523</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;
 &lt;/P&gt;&lt;P&gt; Need some help to understand the behaviour of the below in Spark (using Scala and Databricks)&lt;/P&gt;
 &lt;P&gt; I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I repartitioned the dataframe to make sure each partition has no more than 1000 records. Also, created a json column for each line (so I need only to put them in an array later on)&lt;/P&gt;
 &lt;P&gt; The trouble is on the making the requests. I created the following a Serializable class using the following code&lt;/P&gt; 
 &lt;PRE&gt;&lt;CODE&gt;import org.apache.spark.sql.{DataFrame,Row}import org.apache.http.client.methods.HttpPostimport org.apache.http.impl.client.HttpClientBuilderimport org.apache.http.HttpHeadersimport org.apache.http.entity.StringEntityimport org.apache.commons.io.IOUtilsobject postObject extendsSerializable{  val client =HttpClientBuilder.create().build()  val post =newHttpPost("https://my-cool-api-&amp;lt;a href="https://applinked.me/firestick-tv/" target="_blank"&amp;gt;applinked on fire stick&amp;lt;/a&amp;gt;")  post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")def makeHttpCall(row:Iterator[Row])={      val json_str ="""{"people": ["""+ row.toSeq.map(x =&amp;gt; x.getAs[String]("json")).mkString(",")+"]}"      post.setEntity(newStringEntity(json_str))      val response = client.execute(post)      val entity = response.getEntity()      println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))      println(IOUtils.toString(entity.getContent()))}}&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;    Now when I try the following:&amp;lt;/p&amp;gt;&amp;lt;pre&amp;gt;postObject.makeHttpCall(data.head(2).toIterator)&lt;/CODE&gt;&lt;/PRE&gt;
 &lt;P&gt; It works like a charm. The requests go through, there is some output on the screen, and my API gets that data.&lt;/P&gt;
 &lt;P&gt; But when I try to put it in the foreachPartition:&lt;/P&gt; 
 &lt;PRE&gt;&lt;CODE&gt;data.foreachPartition { x =&amp;gt;  postObject.makeHttpCall(x)}&lt;/CODE&gt;&lt;/PRE&gt;
 &lt;P&gt; Nothing happens. No output on screen, nothing arrives in my API. If I try to rerun it, almost all stages just skips. I believe, for any reason, it is just lazy evaluating my requests, but not actually performing it. I don't understand why, and how to force it.&lt;/P&gt;

&lt;P&gt;Following for answers because I have similar doubt&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 27 Oct 2019 22:14:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/making-http-post-requests-on-spark-using-foreachpartition/m-p/27662#M19523</guid>
      <dc:creator>melo08</dc:creator>
      <dc:date>2019-10-27T22:14:08Z</dc:date>
    </item>
  </channel>
</rss>

