<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Unable to write csv files to Azure BLOB using  pandas to_csv () in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13857#M8444</link>
    <description>&lt;P&gt;Remove the write from the foreach. Instead build a dataframe. return that dataframe  and write that only once.&lt;/P&gt;&lt;P&gt;Now you do a write in each iteration.&lt;/P&gt;</description>
    <pubDate>Mon, 11 Oct 2021 15:03:15 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2021-10-11T15:03:15Z</dc:date>
    <item>
      <title>Unable to write csv files to Azure BLOB using  pandas to_csv ()</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13854#M8441</link>
      <description>&lt;P&gt;I am using a Py function to read some data from a GET endpoint and write them as a CSV file to a Azure BLOB location.&lt;/P&gt;&lt;P&gt;My GET endpoint takes 2 query parameters,param1 and param2. So initially, I have a dataframe paramDf that has two columns param1 and param2.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;param1   param2
12        25
45        95
&amp;nbsp;
Schema:    paramDF:pyspark.sql.dataframe.DataFrame
           param1:string
           param2:string&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Now I write a function as below:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;def executeRestApi(w):
&amp;nbsp;
      dlist=[]
&amp;nbsp;
      try:
&amp;nbsp;
        response=requests.get(DataUrl.format(token=TOKEN, oid=w.param1,wid=w.param2))
&amp;nbsp;
        if(response.status_code==200):
&amp;nbsp;
          metrics=response.json()['data']['metrics']
&amp;nbsp;
          dic={}
&amp;nbsp;
          dic['metric1'] = metrics['metric1']
&amp;nbsp;
          dic['metric2'] = metrics['metric2']
&amp;nbsp;
          dlist.append(dic)
&amp;nbsp;
       pandas.DataFrame(dlist).to_csv("../../dbfs/mnt/raw/Important/MetricData/listofmetrics_{}_{}.csv".format(param1,param2),header=True,index=False)
&amp;nbsp;
    return "Success"
&amp;nbsp;
          
&amp;nbsp;
   except Exception as e:
&amp;nbsp;
        return "Failure"
&amp;nbsp;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Finally, invoke the method as:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;paramDf.foreach(executeRestApi)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;So ,theoretically ,the function executeRestApi must be executed foe each row in the dataframe, and, within the function ,I extract the required data and write it to a ADLS location as a csv file.&lt;/P&gt;&lt;P&gt;All works good ,except that the file is never written when I execute the foreach command on a multi node cluster.&lt;/P&gt;&lt;P&gt;However the same operation &lt;B&gt;works well on a single node cluster&lt;/B&gt;. I am unable to figure out the difference between the two approaches.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;what could i be doing wrong here?&lt;/P&gt;</description>
      <pubDate>Mon, 11 Oct 2021 08:42:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13854#M8441</guid>
      <dc:creator>halfwind22</dc:creator>
      <dc:date>2021-10-11T08:42:37Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to write csv files to Azure BLOB using  pandas to_csv ()</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13856#M8443</link>
      <description>&lt;P&gt;@Kaniz Fatma​&amp;nbsp;Thanks, eagerly awaiting your response.&lt;/P&gt;</description>
      <pubDate>Mon, 11 Oct 2021 14:54:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13856#M8443</guid>
      <dc:creator>halfwind22</dc:creator>
      <dc:date>2021-10-11T14:54:22Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to write csv files to Azure BLOB using  pandas to_csv ()</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13857#M8444</link>
      <description>&lt;P&gt;Remove the write from the foreach. Instead build a dataframe. return that dataframe  and write that only once.&lt;/P&gt;&lt;P&gt;Now you do a write in each iteration.&lt;/P&gt;</description>
      <pubDate>Mon, 11 Oct 2021 15:03:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13857#M8444</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-11T15:03:15Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to write csv files to Azure BLOB using  pandas to_csv ()</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13859#M8446</link>
      <description>&lt;P&gt;@Werner Stinckens​&amp;nbsp;But foreach doesn't return anything right?&lt;/P&gt;</description>
      <pubDate>Mon, 11 Oct 2021 16:44:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13859#M8446</guid>
      <dc:creator>halfwind22</dc:creator>
      <dc:date>2021-10-11T16:44:05Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to write csv files to Azure BLOB using  pandas to_csv ()</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13860#M8447</link>
      <description>&lt;P&gt;foreach itself does not return anything indeed.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Now to solve your issue in a distributed manner, we need to think in a slightly different way.&lt;/P&gt;&lt;P&gt;Looping probably is not the best way.&lt;/P&gt;&lt;P&gt;What you try to do is starting from a DF call a rest api for each record and combine the DF and the result of the rest call.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;There are several ways to do this, but the map() function or a udf seem appropriate:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/64191614/how-to-use-map-to-make-rest-api-calls-in-pyspark" target="test_blank"&gt;https://stackoverflow.com/questions/64191614/how-to-use-map-to-make-rest-api-calls-in-pyspark&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://medium.com/geekculture/how-to-execute-a-rest-api-call-on-apache-spark-the-right-way-in-python-4367f2740e78" target="test_blank"&gt;https://medium.com/geekculture/how-to-execute-a-rest-api-call-on-apache-spark-the-right-way-in-python-4367f2740e78&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/67204599/parallel-rest-api-request-using-sparkdatabricks" target="test_blank"&gt;https://stackoverflow.com/questions/67204599/parallel-rest-api-request-using-sparkdatabricks&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I hope that will put you in the right direction.&lt;/P&gt;&lt;P&gt;If you want to use a loop after all, try to add a collect() before the foreach so all data gets sent to the driver. But that beats the purpose of having multiple nodes.&lt;/P&gt;</description>
      <pubDate>Mon, 11 Oct 2021 18:51:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13860#M8447</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-11T18:51:52Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to write csv files to Azure BLOB using  pandas to_csv ()</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13861#M8448</link>
      <description>&lt;P&gt;@Werner Stinckens​&amp;nbsp;I don't want to use a loop at all .I tried this same thing using UDF too, but again the write to blob part never happened inside the UDF.&lt;/P&gt;&lt;P&gt;I will  try out the strategy that you have suggested, and am also looking at Python multithreading.&lt;/P&gt;</description>
      <pubDate>Tue, 12 Oct 2021 05:36:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13861#M8448</guid>
      <dc:creator>halfwind22</dc:creator>
      <dc:date>2021-10-12T05:36:52Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to write csv files to Azure BLOB using  pandas to_csv ()</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13862#M8449</link>
      <description>&lt;P&gt;@Werner Stinckens​&amp;nbsp; But I still have a question. Why is the write operation that I am attempting to ,not working in the first place? Is it something to do with how things are designed?&lt;/P&gt;</description>
      <pubDate>Tue, 12 Oct 2021 05:38:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13862#M8449</guid>
      <dc:creator>halfwind22</dc:creator>
      <dc:date>2021-10-12T05:38:13Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to write csv files to Azure BLOB using  pandas to_csv ()</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13863#M8450</link>
      <description>&lt;P&gt;Ok let's see.&lt;/P&gt;&lt;P&gt;You have a spark dataframe, which is distributed by nature.&lt;/P&gt;&lt;P&gt;On the other hand you use pandas, which is not distributed.&lt;/P&gt;&lt;P&gt;Also your function is plain python, not pyspark.&lt;/P&gt;&lt;P&gt;This will be processed by the driver, so not distributed.&lt;/P&gt;&lt;P&gt;However, the dataframe itself will be processed by the workers.&lt;/P&gt;&lt;P&gt;So the workers want to do something but the code runs on the driver.  It could be that. Not sure though, but the fact it works as single node makes me think your code is not executed in a distributed environment.&lt;/P&gt;&lt;P&gt;That is why I mentioned the collect().  This will collect the spark dataframe on the driver so the workers are bypassed in your case.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Moving from python to pyspark takes  some time to understand.&lt;/P&gt;&lt;P&gt;This blog explains some interesting topics:&lt;/P&gt;&lt;P&gt;&lt;A href="https://medium.com/hashmapinc/5-steps-to-converting-python-jobs-to-pyspark-4b9988ad027a" target="test_blank"&gt;https://medium.com/hashmapinc/5-steps-to-converting-python-jobs-to-pyspark-4b9988ad027a&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Using koalas instead of pandas f.e.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Oct 2021 08:03:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13863#M8450</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-12T08:03:16Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to write csv files to Azure BLOB using  pandas to_csv ()</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13864#M8451</link>
      <description>&lt;P&gt;use Spark DataFrame instead,&lt;/P&gt;&lt;P&gt;safer is to use dbfs path "dbfs:/mnt/raw/Important/MetricData...."&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Oct 2021 13:36:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13864#M8451</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-10-12T13:36:22Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to write csv files to Azure BLOB using  pandas to_csv ()</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13865#M8452</link>
      <description>&lt;P&gt;@Hubert Dudek​&amp;nbsp;I cant issue a spark command to executor node, throws up an error ,because foreach distributes the processing.&lt;/P&gt;</description>
      <pubDate>Tue, 12 Oct 2021 13:38:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-write-csv-files-to-azure-blob-using-pandas-to-csv/m-p/13865#M8452</guid>
      <dc:creator>halfwind22</dc:creator>
      <dc:date>2021-10-12T13:38:33Z</dc:date>
    </item>
  </channel>
</rss>

