cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Errors Using Selenium/Chromedriver in DataBricks

Gray
Contributor

Hello,

I’m programming in a notebook and attempting to use the python library Selenium to automate Chrome/chromedriver. I’ve successfully managed to install selenium using

%sh
 pip install selenium

I then attempt the following code, which results in the WebdriverException, copied below.

from selenium import webdriver
driver = webdriver.Chrome()

Error:

WebdriverException: Message: ‘chromedriver’ executable needs to be in PATH. Please see https://chromedriver.chromium.org/home

After troubleshooting the error, I attempted instead to use webdriver-manager to install the instance of chromedriver as follows, whilst also running it headless.

%sh
pip install webdriver-manager
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
 
options = Options()
options.add_argument(“—headless”)
 
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

This time, I got the following error:

WebdriverException: Message: Service /root/.wdm/drivers/chromedriver/linux64/107.0.5304/chromedriver unexpectedly exited. Status code was: 127

I’ve roamed the internet for a solution, but no matter what I try, my code ends up throwing one of the two WebDriverException errors above. 

Does anybody know how I can get selenium running on DataBricks in order to automate Chrome/chromedriver?

Thanks!

26 REPLIES 26

keithkifo
New Contributor II

Hi Gray, I was looking for your script but I don't think you no longer have any file attached to your reply. Would really love your help on this!

Kaizen
Contributor III

The attached source file seems to be missing

Also what cluster access type are you running? Shared doesnt let us access the file system since it is protected resulting in error like: 
WebDriverException: Message: Can not connect to the Service /databricks/.pyenv/bin/chromedriver

Kaizen_0-1705602857683.png

 

 

 

 

dungruoc
New Contributor III

Hi @Gray ,

I do not find your attached source file? It might be helpful as I am facing the same issue.

Thanks,

what worked for me was this solution: https://stackoverflow.com/a/76515841/22103209 

dungruoc
New Contributor III

Thank you!

Which databricks runtine engine version did you use?

I am facing some trouble with apt-get, due to security I think, so it still fails at that step.

I used 10.4 ML.

dungruoc
New Contributor III

@Evan_MCK , @Kaizen : thank you! I could make it work with my Community Databricks account's cluster, from Evan's stackoverflow link.

Btw, I am facing the issue of security with my production cluster. Will need to work with that, but it should work in principle.

cheers,

this is what im using to install currently

 

# !/bin/bash

# Script: selenium_init.sh
# Description: Installs Chrome and Chromedriver with path assignment for Selenium.

set -x

sudo apt update
sudo apt upgrade -y 2>upgrade_errors.log

# install require chrome libraries
sudo apt-get update
sudo apt-get install fonts-liberation
sudo apt-get install libgbm1 -y
sudo apt-get install libu2f-udev -y

# create directory
TMP_DIR="tmp"
CHROME_DIR="${TMP_DIR}/chrome"
CHROMEDRIVER_DIR="${TMP_DIR}/chromedriver"

# do a clean install - delete all old drivers
rm -rf ${CHROME_DIR}
rm -rf ${CHROMEDRIVER_DIR}

mkdir -p ${CHROME_DIR}
mkdir -p ${CHROMEDRIVER_DIR}

cd ${CHROME_DIR}

# alternative google chrome download below is better for supporting latest version
# sudo dpkg -i google-chrome-stable_current_amd64.deb
# sudo apt-get install -f -y
# sudo apt-get update --fix-missing

# download chrome from same source as chromedriver
unzip chrome-linux64.zip
.chrome-linux64/chrome

google-chrome --version || echo "look into google-chrome install"

# install chromedriver -> find all driver files here: https://googlechromelabs.github.io/chrome-for-testing/
cd ..
cd ..
cd ${CHROMEDRIVER_DIR}

# download chromedriver
unzip chromedriver-linux64.zip
sudo chown root:root /databricks/driver/tmp/chromedriver
sudo chmod +x /databricks/driver/tmp/chromedriver

# TODO: path assignment is not persistant -> Likely permissions issue or databricks compute engine limitation
# Current workaround -> move to preassigned system path
# sudo export PATH=$PATH:/databricks/driver/tmp/chromedriver/chromedriver-linux64/

# copy file to a location defined in path
echo pwd
sudo cp -r chromedriver-linux64/chromedriver /databricks/.pyenv/bin
cd ..
sudo cp -r chrome/chrome-linux64 /databricks/.pyenv/bin

echo "copied file over- new file dir"
cd /databricks/.pyenv/bin
ls

echo "Script execution completed successfully"
echo "Chrome version:" $(google-chrome --version)
echo "Chromedriver version:" $(chromedriver --version)

Kaizen
Contributor III

dungruoc
New Contributor III

@Kaizen , @Evan_MCK : I refactored here a notebook with the elements collected from your posts. I works.

# imports needed for notebook
from datetime import datetime
import dateutil.relativedelta
import os
import time
import urllib.request, json 

def get_latest_driver_url():
  with urllib.request.urlopen("https://googlechromelabs.github.io/chrome-for-testing/last-known-good-versions-with-downloads.json") as url:
      data = json.load(url)
      print(data['channels']['Stable']['version'])
      url = data['channels']['Stable']['downloads']['chromedriver'][0]['url']
      # print(url)
      # set the url as environment variable to use in scripting 
      # os.environ['latest_chromedriver_url']= url
      return url
    
latest_chromedriver_url = get_latest_driver_url()
print(latest_chromedriver_url)

Make an init script

dbutils.fs.mkdirs("dbfs:/databricks/scripts/")
dbutils.fs.put("/databricks/scripts/selenium-install.sh",f"""
#!/bin/bash

latest_chromedriver_url="{latest_chromedriver_url}"
wget -N $latest_chromedriver_url  -O /tmp/chromedriver_linux64.zip
rm -rf /tmp/chromedriver/
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/

sudo apt-get clean && sudo apt-get update --fix-missing -y

sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable
""", True)

display(dbutils.fs.ls("dbfs:/databricks/scripts/"))

Run it (or put in cluster init script config to automatically run it at cluster start)

%sh
/dbfs/databricks/scripts/selenium-install.sh

Install Selenium and restart Python kernel (or put it in PiPy package to install at start of cluster)

%pip install selenium
dbutils.library.restartPython()

Init the driver

# imports needed for notebook
from datetime import datetime
import dateutil.relativedelta
import os
import time
import urllib.request, json 
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait

def init_chrome_browser(download_path, chrome_driver_path,  url):
     
    options = Options()
    prefs = {'download.default_directory' : download_path, 'profile.default_content_setting_values.automatic_downloads': 1, "download.prompt_for_download": False,
  "download.directory_upgrade": True,   "safebrowsing.enabled": True ,
  "translate_whitelists": {"vi":"en"},
  "translate":{"enabled":"true"}}
    options.add_experimental_option('prefs', prefs)
    options.add_argument('--no-sandbox')
    options.add_argument('--headless')    # wont work without this feature in databricks can't display browser
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument('--start-maximized')
    options.add_argument('window-size=2560,1440')
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--ignore-ssl-errors')
    options.add_argument('--lang=en')
    options.add_experimental_option('excludeSwitches', ['enable-logging'])
    print(f"{datetime.now()}    Launching Chrome...")
    browser = webdriver.Chrome(service=Service(chrome_driver_path), options=options)
    print(f"{datetime.now()}    Chrome launched.")
    browser.get(url)
    print(f"{datetime.now()}    Browser ready to use.")
    return browser

driver = init_chrome_browser(
    download_path="/tmp/downloads",
    chrome_driver_path="/tmp/chromedriver/chromedriver-linux64/chromedriver",
    url= "https://www.google.com"
)

Test it

from selenium.webdriver.common.by import By

driver.find_element(By.CSS_SELECTOR, "img").get_attribute("alt")

Close the driver

driver.quit()

 

aa_204
New Contributor II

I also tried the script and am getting similar error. Can anyone please give some resolution for it?

imageError in Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/s/systemd/udev_245.4-4ubuntu3.18_amd64.deb and Unable to fetch some archives

EB613
New Contributor II

I had same issue try this as i answered previous question:

from this post

%sh
sudo rm -r /var/lib/apt/lists/* 
sudo apt clean && 
   sudo apt update --fix-missing -y