pyspark: How to run selenium in UDF

DievanB — Thu, 22 Jul 2021 07:10:48 GMT

Hi all,

I am building a webscraper to get prices of certain EAN's from the amazon website. Therefore I use selenium to get the product links. I wrote te following function to get the productlinks based on a EAN:

def getProductLinkAmazonPY(EAN):
  startURL = 'https://www.amazon.nl'
  driver.get(startURL)
  element = driver.find_element_by_id('twotabsearchtextbox')
  element.send_keys(EAN)
  element.send_keys(Keys.RETURN);
  productPage = [elem.get_attribute("href") for elem in driver.find_elements_by_xpath("//*[@class='a-link-normal a-text-normal']")] 
  if productPage != []:
    productPage = productPage[0]
    return[productPage, EAN]

Does somebody know how to run this function parallel in pyspark using an UDF?

Thanks

Re: pyspark: How to run selenium in UDF

Hubert-Dudek — Sat, 27 Nov 2021 17:03:16 GMT

UDF functions are serialized and then executed on executors. I don't think it will be possible with Selenium.

topic pyspark: How to run selenium in UDF in Data Engineering

pyspark: How to run selenium in UDF

Re: pyspark: How to run selenium in UDF