Hi all,
I am building a webscraper to get prices of certain EAN's from the amazon website. Therefore I use selenium to get the product links. I wrote te following function to get the productlinks based on a EAN:
def getProductLinkAmazonPY(EAN):
startURL = 'https://www.amazon.nl'
driver.get(startURL)
element = driver.find_element_by_id('twotabsearchtextbox')
element.send_keys(EAN)
element.send_keys(Keys.RETURN);
productPage = [elem.get_attribute("href") for elem in driver.find_elements_by_xpath("//*[@class='a-link-normal a-text-normal']")]
if productPage != []:
productPage = productPage[0]
return[productPage, EAN]
Does somebody know how to run this function parallel in pyspark using an UDF?
Thanks