- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-11-2021 01:42 AM
I am using a Py function to read some data from a GET endpoint and write them as a CSV file to a Azure BLOB location.
My GET endpoint takes 2 query parameters,param1 and param2. So initially, I have a dataframe paramDf that has two columns param1 and param2.
param1 param2
12 25
45 95
Schema: paramDF:pyspark.sql.dataframe.DataFrame
param1:string
param2:stringNow I write a function as below:
def executeRestApi(w):
dlist=[]
try:
response=requests.get(DataUrl.format(token=TOKEN, oid=w.param1,wid=w.param2))
if(response.status_code==200):
metrics=response.json()['data']['metrics']
dic={}
dic['metric1'] = metrics['metric1']
dic['metric2'] = metrics['metric2']
dlist.append(dic)
pandas.DataFrame(dlist).to_csv("../../dbfs/mnt/raw/Important/MetricData/listofmetrics_{}_{}.csv".format(param1,param2),header=True,index=False)
return "Success"
except Exception as e:
return "Failure"
Finally, invoke the method as:
paramDf.foreach(executeRestApi)So ,theoretically ,the function executeRestApi must be executed foe each row in the dataframe, and, within the function ,I extract the required data and write it to a ADLS location as a csv file.
All works good ,except that the file is never written when I execute the foreach command on a multi node cluster.
However the same operation works well on a single node cluster. I am unable to figure out the difference between the two approaches.
what could i be doing wrong here?
- Labels:
-
Adlsgen2
-
Azureblob
-
Foreachpartition