pyspark: How to run selenium in UDF
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-22-2021 12:10 AM
Hi all,
I am building a webscraper to get prices of certain EAN's from the amazon website. Therefore I use selenium to get the product links. I wrote te following function to get the productlinks based on a EAN:
def getProductLinkAmazonPY(EAN):
startURL = 'https://www.amazon.nl'
driver.get(startURL)
element = driver.find_element_by_id('twotabsearchtextbox')
element.send_keys(EAN)
element.send_keys(Keys.RETURN);
productPage = [elem.get_attribute("href") for elem in driver.find_elements_by_xpath("//*[@class='a-link-normal a-text-normal']")]
if productPage != []:
productPage = productPage[0]
return[productPage, EAN]
Does somebody know how to run this function parallel in pyspark using an UDF?
Thanks
Labels:
- Labels:
-
Selenium
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-27-2021 09:03 AM
UDF functions are serialized and then executed on executors. I don't think it will be possible with Selenium.
My blog: https://databrickster.medium.com/