cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Errors Using Selenium/Chromedriver in DataBricks

Gray
Contributor

Hello,

I’m programming in a notebook and attempting to use the python library Selenium to automate Chrome/chromedriver. I’ve successfully managed to install selenium using

%sh
 pip install selenium

I then attempt the following code, which results in the WebdriverException, copied below.

from selenium import webdriver
driver = webdriver.Chrome()

Error:

WebdriverException: Message: ‘chromedriver’ executable needs to be in PATH. Please see https://chromedriver.chromium.org/home

After troubleshooting the error, I attempted instead to use webdriver-manager to install the instance of chromedriver as follows, whilst also running it headless.

%sh
pip install webdriver-manager
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
 
options = Options()
options.add_argument(“—headless”)
 
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

This time, I got the following error:

WebdriverException: Message: Service /root/.wdm/drivers/chromedriver/linux64/107.0.5304/chromedriver unexpectedly exited. Status code was: 127

I’ve roamed the internet for a solution, but no matter what I try, my code ends up throwing one of the two WebDriverException errors above. 

Does anybody know how I can get selenium running on DataBricks in order to automate Chrome/chromedriver?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Gray
Contributor

@Kaniz Fatma​  @Vidula Khanna​  @Hubert Dudek​ 

My colleague and I were finally able to get Selenium running in a notebook. Although I can't explain in detail why this solution works, I have attached the source file below.

Hopefully this might help somebody in the future!

Cheers

View solution in original post

24 REPLIES 24

Hubert-Dudek
Esteemed Contributor III

Maybe my manual on how to run selenium on Databricks will help:

In the clusters library tab, please install PyPi chromedriver-binary==83.0 (or higher, probably version in the script can also be updated)

Please run the below script from the notebook to create "/databricks/scripts/selenium-install.sh" file.

dbutils.fs.mkdirs("dbfs:/databricks/scripts/")
dbutils.fs.put("/databricks/scripts/selenium-install.sh","""
#!/bin/bash
apt-get update
apt-get install chromium-browser=91.0.4472.101-0ubuntu0.18.04.1 --yes
wget https://chromedriver.storage.googleapis.com/91.0.4472.101/chromedriver_linux64.zip -O /tmp/chromedriver.zip
mkdir /tmp/chromedriver
unzip /tmp/chromedriver.zip -d /tmp/chromedriver/
""", True)
display(dbutils.fs.ls("dbfs:/databricks/scripts/"))

Please add "/databricks/scripts/selenium-install.sh" as starting script - init in cluster config.

Later in the notebook, you can use chrome, as in the below example.

from selenium import webdriver
chrome_driver = '/tmp/chromedriver/chromedriver'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')
# chrome_options.add_argument('--disable-dev-shm-usage') 
chrome_options.add_argument('--homedir=/dbfs/tmp')
chrome_options.add_argument('--user-data-dir=/dbfs/selenium')
# prefs = {"download.default_directory":"/dbfs/tmp",
#          "download.prompt_for_download":False
# }
# chrome_options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(executable_path=chrome_driver, options=chrome_options)

Hi Hubert,

Thank you for your quick response! I've copied your code across to my notebook. However, when I run the following code

%sh
/dbfs/databricks/scripts/selenium-install.sh

I get the following output

Hit:1 https://repos.azul.com/zulu/deb stable InRelease
Hit:2 http://security.ubuntu.com/ubuntu focal-security InRelease
Hit:3 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
E: Version '91.0.4472.101-0ubuntu0.18.04.1' for 'chromium-browser' was not found
/dbfs/databricks/scripts/selenium-install.sh: line 5: --yes: command not found
--2022-11-03 13:02:23--  https://chromedriver.storage.googleapis.com/91.0.4472.101/
Resolving chromedriver.storage.googleapis.com (chromedriver.storage.googleapis.com)... 209.85.202.128, 2a00:1450:400b:c01::80
Connecting to chromedriver.storage.googleapis.com (chromedriver.storage.googleapis.com)|209.85.202.128|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-11-03 13:02:24 ERROR 404: Not Found.
 
/dbfs/databricks/scripts/selenium-install.sh: line 7: chromedriver_linux64.zip: command not found
mkdir: invalid option -- 'd'
Try 'mkdir --help' for more information.

And consequently, when I run this code block:

from selenium import webdriver
chrome_driver = '/tmp/chromedriver/chromedriver'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')
# chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--homedir=/dbfs/tmp')
chrome_options.add_argument('--user-data-dir=/dbfs/selenium')
# prefs = {"download.default_director":"/dbfs/tmp",
#          "download.prompt_for_download":False
# }
# chrome_options.add_experimental_options("prefs",prefs)
driver = webdriver.Chrome(executable_path=chrome_driver, options=chrome_options)

I receive the following error:

WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://chromedriver.chromium.org/home

Is this something you can shed some light on for me please?

Thank you for your help!

Anonymous
Not applicable

Hi @Henry Gray​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Hubert-Dudek
Esteemed Contributor III

Hi, @Henry Gray​ . I've created a new version of the selenium with the databricks manual. Please look here https://community.databricks.com/s/feed/0D58Y00009SWgVuSAL

Gray
Contributor

@Kaniz Fatma​  @Vidula Khanna​  @Hubert Dudek​ 

My colleague and I were finally able to get Selenium running in a notebook. Although I can't explain in detail why this solution works, I have attached the source file below.

Hopefully this might help somebody in the future!

Cheers

luck_az
New Contributor III

Hi @Henry Gray​ , there is one command in your script, which is. running forever. If i am skipping that command, my chromedriver is not working. [xvfb-run java -Dwebdriver.chrome.driver=/usr/bin/chromedriver -jar selenium-server.jar. Can you please suggest how to proceed?]

Hi,

My colleague and I also found that line started running infinitely. We tinkered with the code and did the following to make it work.

1) Remove the following two portions of code:

%sh
wget https://github.com/SeleniumHQ/selenium/releases/download/selenium-4.1.0/selenium-server-4.1.2.jar
mv selenium-server-4.1.2.jar selenium-server.jar
%sh
sudo apt install xvfb
xvfb-run java -D webdriver.chrome.driver=/usr/bin/chromedriver -jar selenium-server.jar

2) Add the following code to the beginning:

%sh
sudo rm -r /var/lib/apt/lists/* 
sudo apt clean && 
  sudo apt update --fix-missing -y &&
  sudo apt install -y  libmysqlclient21
sudo apt install -y gdal-bin

Additionally, fyi, our runtime version of DataBricks is 0.4 LTS (includes Apache Spark 3.2.1, Scala 2.12).

I'm not sure why this works, but hopefully it will fix your issues.

Cheers!

luck_az
New Contributor III

Thanks, it worked. Great work.

luck_az
New Contributor III

Hi @Henry Gray​  , i want to access vpn using selenium in databricks. Do you have any idea , how we can do that ?

acristinar
New Contributor II

This solution saved my life! Thank you so much for posting it!

SShiv
New Contributor II

I tried this script but got the following response. How do I fix this?

databricks_snip

Anonymous
Not applicable

I had same issue try this:

from this post

%sh
sudo rm -r /var/lib/apt/lists/* 
sudo apt clean && 
   sudo apt update --fix-missing -y

 

keithkifo
New Contributor II

Hi Gray, I was looking for your script but I don't think you no longer have any file attached to your reply. Would really love your help on this!

Kaizen
Valued Contributor

The attached source file seems to be missing

Also what cluster access type are you running? Shared doesnt let us access the file system since it is protected resulting in error like: 
WebDriverException: Message: Can not connect to the Service /databricks/.pyenv/bin/chromedriver

Kaizen_0-1705602857683.png

 

 

 

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group