Databricks Community

Gray · ‎11-01-2022

Hello,

I’m programming in a notebook and attempting to use the python library Selenium to automate Chrome/chromedriver. I’ve successfully managed to install selenium using

%sh
 pip install selenium

I then attempt the following code, which results in the WebdriverException, copied below.

from selenium import webdriver
driver = webdriver.Chrome()

Error:

WebdriverException: Message: ‘chromedriver’ executable needs to be in PATH. Please see https://chromedriver.chromium.org/home

After troubleshooting the error, I attempted instead to use webdriver-manager to install the instance of chromedriver as follows, whilst also running it headless.

%sh
pip install webdriver-manager

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
 
options = Options()
options.add_argument(“—headless”)
 
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

This time, I got the following error:

WebdriverException: Message: Service /root/.wdm/drivers/chromedriver/linux64/107.0.5304/chromedriver unexpectedly exited. Status code was: 127

I’ve roamed the internet for a solution, but no matter what I try, my code ends up throwing one of the two WebDriverException errors above.

Does anybody know how I can get selenium running on DataBricks in order to automate Chrome/chromedriver?

Thanks!

dungruoc · ‎03-13-2024

Hi @Gray ,

I do not find your attached source file? It might be helpful as I am facing the same issue.

Thanks,

Evan_MCK · ‎03-13-2024

what worked for me was this solution: https://stackoverflow.com/a/76515841/22103209

dungruoc · ‎03-13-2024

Thank you!

Which databricks runtine engine version did you use?

I am facing some trouble with apt-get, due to security I think, so it still fails at that step.

Evan_MCK · ‎03-14-2024

I used 10.4 ML.

dungruoc · ‎03-13-2024

@Evan_MCK , @Kaizen : thank you! I could make it work with my Community Databricks account's cluster, from Evan's stackoverflow link.

Btw, I am facing the issue of security with my production cluster. Will need to work with that, but it should work in principle.

cheers,

Kaizen · ‎03-13-2024

this is what im using to install currently

# !/bin/bash

# Script: selenium_init.sh

# Description: Installs Chrome and Chromedriver with path assignment for Selenium.

set -x

sudo apt update

sudo apt upgrade -y 2>upgrade_errors.log

# install require chrome libraries

sudo apt-get update

sudo apt-get install fonts-liberation

sudo apt-get install libgbm1 -y

sudo apt-get install libu2f-udev -y

# create directory

TMP_DIR="tmp"

CHROME_DIR="${TMP_DIR}/chrome"

CHROMEDRIVER_DIR="${TMP_DIR}/chromedriver"

# do a clean install - delete all old drivers

rm -rf ${CHROME_DIR}

rm -rf ${CHROMEDRIVER_DIR}

mkdir -p ${CHROME_DIR}

mkdir -p ${CHROMEDRIVER_DIR}

# install chrome -> ref https://www.baeldung.com/linux/chrome-installation-terminal

cd ${CHROME_DIR}

# alternative google chrome download below is better for supporting latest version

# wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

# sudo dpkg -i google-chrome-stable_current_amd64.deb

# sudo apt-get install -f -y

# sudo apt-get update --fix-missing

# download chrome from same source as chromedriver

wget https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/120.0.6099.109/linux64/chrome-linux64.zi...

unzip chrome-linux64.zip

.chrome-linux64/chrome

google-chrome --version || echo "look into google-chrome install"

# install chromedriver -> find all driver files here: https://googlechromelabs.github.io/chrome-for-testing/

cd ..

cd ${CHROMEDRIVER_DIR}

# download chromedriver

curl -SL https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/120.0.6099.109/linux64/chromedriver-linu... -o chromedriver-linux64.zip

unzip chromedriver-linux64.zip

sudo chown root:root /databricks/driver/tmp/chromedriver

sudo chmod +x /databricks/driver/tmp/chromedriver

# TODO: path assignment is not persistant -> Likely permissions issue or databricks compute engine limitation

# Current workaround -> move to preassigned system path

# sudo export PATH=$PATH:/databricks/driver/tmp/chromedriver/chromedriver-linux64/

# copy file to a location defined in path

echo pwd

sudo cp -r chromedriver-linux64/chromedriver /databricks/.pyenv/bin

cd ..

sudo cp -r chrome/chrome-linux64 /databricks/.pyenv/bin

echo "copied file over- new file dir"

cd /databricks/.pyenv/bin

ls

echo "Script execution completed successfully"

echo "Chrome version:" $(google-chrome --version)

echo "Chromedriver version:" $(chromedriver --version)

Kaizen · ‎03-13-2024

also check out playwright its a lot easier to install

https://community.databricks.com/t5/community-discussions/using-python-rpa-library-on-databricks/td-...

dungruoc · ‎03-15-2024

@Kaizen , @Evan_MCK : I refactored here a notebook with the elements collected from your posts. I works.

# imports needed for notebook
from datetime import datetime
import dateutil.relativedelta
import os
import time
import urllib.request, json 

def get_latest_driver_url():
  with urllib.request.urlopen("https://googlechromelabs.github.io/chrome-for-testing/last-known-good-versions-with-downloads.json") as url:
      data = json.load(url)
      print(data['channels']['Stable']['version'])
      url = data['channels']['Stable']['downloads']['chromedriver'][0]['url']
      # print(url)
      # set the url as environment variable to use in scripting 
      # os.environ['latest_chromedriver_url']= url
      return url
    
latest_chromedriver_url = get_latest_driver_url()
print(latest_chromedriver_url)

Make an init script

dbutils.fs.mkdirs("dbfs:/databricks/scripts/")
dbutils.fs.put("/databricks/scripts/selenium-install.sh",f"""
#!/bin/bash

latest_chromedriver_url="{latest_chromedriver_url}"
wget -N $latest_chromedriver_url  -O /tmp/chromedriver_linux64.zip
rm -rf /tmp/chromedriver/
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/

sudo apt-get clean && sudo apt-get update --fix-missing -y

sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable
""", True)

display(dbutils.fs.ls("dbfs:/databricks/scripts/"))

Run it (or put in cluster init script config to automatically run it at cluster start)

%sh
/dbfs/databricks/scripts/selenium-install.sh

Install Selenium and restart Python kernel (or put it in PiPy package to install at start of cluster)

%pip install selenium

dbutils.library.restartPython()

Init the driver

# imports needed for notebook
from datetime import datetime
import dateutil.relativedelta
import os
import time
import urllib.request, json 
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait

def init_chrome_browser(download_path, chrome_driver_path,  url):
     
    options = Options()
    prefs = {'download.default_directory' : download_path, 'profile.default_content_setting_values.automatic_downloads': 1, "download.prompt_for_download": False,
  "download.directory_upgrade": True,   "safebrowsing.enabled": True ,
  "translate_whitelists": {"vi":"en"},
  "translate":{"enabled":"true"}}
    options.add_experimental_option('prefs', prefs)
    options.add_argument('--no-sandbox')
    options.add_argument('--headless')    # wont work without this feature in databricks can't display browser
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument('--start-maximized')
    options.add_argument('window-size=2560,1440')
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--ignore-ssl-errors')
    options.add_argument('--lang=en')
    options.add_experimental_option('excludeSwitches', ['enable-logging'])
    print(f"{datetime.now()}    Launching Chrome...")
    browser = webdriver.Chrome(service=Service(chrome_driver_path), options=options)
    print(f"{datetime.now()}    Chrome launched.")
    browser.get(url)
    print(f"{datetime.now()}    Browser ready to use.")
    return browser

driver = init_chrome_browser(
    download_path="/tmp/downloads",
    chrome_driver_path="/tmp/chromedriver/chromedriver-linux64/chromedriver",
    url= "https://www.google.com"
)

Test it

from selenium.webdriver.common.by import By

driver.find_element(By.CSS_SELECTOR, "img").get_attribute("alt")

Close the driver

driver.quit()

aa_204 · ‎12-15-2022

I also tried the script and am getting similar error. Can anyone please give some resolution for it?

Error in Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/s/systemd/udev_245.4-4ubuntu3.18_amd64.deb and Unable to fetch some archives

Anonymous · ‎06-27-2023

I had same issue try this as i answered previous question:

from this post

%sh
sudo rm -r /var/lib/apt/lists/* 
sudo apt clean && 
   sudo apt update --fix-missing -y

Databricks Community

Errors Using Selenium/Chromedriver in DataBricks

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks