cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

pyspark: How to run selenium in UDF

DievanB
New Contributor

Hi all,

I am building a webscraper to get prices of certain EAN's from the amazon website. Therefore I use selenium to get the product links. I wrote te following function to get the productlinks based on a EAN:

def getProductLinkAmazonPY(EAN):
  startURL = 'https://www.amazon.nl'
  driver.get(startURL)
  element = driver.find_element_by_id('twotabsearchtextbox')
  element.send_keys(EAN)
  element.send_keys(Keys.RETURN);
  productPage = [elem.get_attribute("href") for elem in driver.find_elements_by_xpath("//*[@class='a-link-normal a-text-normal']")] 
  if productPage != []:
    productPage = productPage[0]
    return[productPage, EAN]

Does somebody know how to run this function parallel in pyspark using an UDF?

Thanks

1 REPLY 1

Hubert-Dudek
Esteemed Contributor III

UDF functions are serialized and then executed on executors. I don't think it will be possible with Selenium.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.