pyspark: How to run selenium in UDF

DievanB · ‎07-22-2021

Hi all,

I am building a webscraper to get prices of certain EAN's from the amazon website. Therefore I use selenium to get the product links. I wrote te following function to get the productlinks based on a EAN:

def getProductLinkAmazonPY(EAN):
  startURL = 'https://www.amazon.nl'
  driver.get(startURL)
  element = driver.find_element_by_id('twotabsearchtextbox')
  element.send_keys(EAN)
  element.send_keys(Keys.RETURN);
  productPage = [elem.get_attribute("href") for elem in driver.find_elements_by_xpath("//*[@class='a-link-normal a-text-normal']")] 
  if productPage != []:
    productPage = productPage[0]
    return[productPage, EAN]

Does somebody know how to run this function parallel in pyspark using an UDF?

Thanks

Hubert-Dudek · ‎11-27-2021

UDF functions are serialized and then executed on executors. I don't think it will be possible with Selenium.

My blog: https://databrickster.medium.com/