Unable to write csv files to Azure BLOB using pan...

halfwind22 · ‎10-11-2021

I am using a Py function to read some data from a GET endpoint and write them as a CSV file to a Azure BLOB location.

My GET endpoint takes 2 query parameters,param1 and param2. So initially, I have a dataframe paramDf that has two columns param1 and param2.

param1   param2
12        25
45        95
 
Schema:    paramDF:pyspark.sql.dataframe.DataFrame
           param1:string
           param2:string

Now I write a function as below:

def executeRestApi(w):
 
      dlist=[]
 
      try:
 
        response=requests.get(DataUrl.format(token=TOKEN, oid=w.param1,wid=w.param2))
 
        if(response.status_code==200):
 
          metrics=response.json()['data']['metrics']
 
          dic={}
 
          dic['metric1'] = metrics['metric1']
 
          dic['metric2'] = metrics['metric2']
 
          dlist.append(dic)
 
       pandas.DataFrame(dlist).to_csv("../../dbfs/mnt/raw/Important/MetricData/listofmetrics_{}_{}.csv".format(param1,param2),header=True,index=False)
 
    return "Success"
 
          
 
   except Exception as e:
 
        return "Failure"

Finally, invoke the method as:

paramDf.foreach(executeRestApi)

So ,theoretically ,the function executeRestApi must be executed foe each row in the dataframe, and, within the function ,I extract the required data and write it to a ADLS location as a csv file.

All works good ,except that the file is never written when I execute the foreach command on a multi node cluster.

However the same operation works well on a single node cluster. I am unable to figure out the difference between the two approaches.

what could i be doing wrong here?

Unable to write csv files to Azure BLOB using pandas to_csv ()