cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Selenium chrome driver on databricks driverย On the databricks community, I see repeated problems regarding the selenium installation on the databricks...

Hubert-Dudek
Esteemed Contributor III

Selenium chrome driver on databricks driver

On the databricks community, I see repeated problems regarding the selenium installation on the databricks driver. Installing selenium on databricks can be surprising, but for example, sometimes we need to grab some datasets behind fancy authentication, and selenium is the most accessible tool to do that. Of course, always remember to check the most uncomplicated alternatives first. For example, if we need to download an HTML file, we can use SparkContext.addFile() or just use the requests library. If we need to parse HTML without simulating user actions or downloading complicated pages, we can use BeautifulSoap. Please remember that selenium is running on the driver only (workers are not utilized), so just for the selenium part single node cluster is the preferred setting.

Installation

The easiest solution is to use apt-get to install ubuntu packages, but often version in the ubuntu repo is outdated. Recently that solution stopped working for me, and I decided to take a different approach and to get the driver and binaries from chromium-browser-snapshots https://commondatastorage.googleapis.com/chromium-browser-snapshots/index.html Below script download the newest version of browser binaries and driver. Everything is saved to /tmp/chrome directory. We must also set the chrome home directory to /tmp/chrome/chrome-user-data-dir. Sometimes, chromium complains about missing libraries. That's why we also install libgbm-dev. The below script will create a bash file implementing mentioned steps.

dbutils.fs.mkdirs("dbfs:/databricks/scripts/")
dbutils.fs.put("/databricks/scripts/selenium-install.sh","""
#!/bin/bash
%sh
LAST_VERSION="https://www.googleapis.com/download/storage/v1/b/chromium-browser-snapshots/o/Linux_x64%2FLAST_CHANGE?alt=media"
VERSION=$(curl -s -S $LAST_VERSION)
if [ -d $VERSION ] ; then
  echo "version already installed"
  exit
fi
 
rm -rf /tmp/chrome/$VERSION
mkdir -p /tmp/chrome/$VERSION
 
URL="https://www.googleapis.com/download/storage/v1/b/chromium-browser-snapshots/o/Linux_x64%2F$VERSION%2Fchrome-linux.zip?alt=media"
ZIP="${VERSION}-chrome-linux.zip"
 
curl -# $URL > /tmp/chrome/$ZIP
unzip /tmp/chrome/$ZIP -d /tmp/chrome/$VERSION
 
URL="https://www.googleapis.com/download/storage/v1/b/chromium-browser-snapshots/o/Linux_x64%2F$VERSION%2Fchromedriver_linux64.zip?alt=media"
ZIP="${VERSION}-chromedriver_linux64.zip"
 
curl -# $URL > /tmp/chrome/$ZIP
unzip /tmp/chrome/$ZIP -d /tmp/chrome/$VERSION
 
mkdir -p /tmp/chrome/chrome-user-data-dir
 
rm -f /tmp/chrome/latest
ln -s /tmp/chrome/$VERSION /tmp/chrome/latest
 
# to avoid errors about missing libraries
sudo apt-get update
sudo apt-get install -y libgbm-dev
""", True)
display(dbutils.fs.ls("dbfs:/databricks/scripts/"))

The script was saved to DBFS storage as /dbfs/databricks/scripts/selenium-install.sh We can set it as an init script for the server. Click your cluster in "compute" -> click "Edit" -> "configuration" tab -> scroll down to "Advanced options" -> click "Init Scripts" -> select "DBFS" and set "Init script path" as "/dbfs/databricks/scripts/selenium-install.sh" -> click "add".

initIf you haven't set the init script, please run the below command.

%sh
/dbfs/databricks/scripts/selenium-install.sh

Now we can install selenium. Click your cluster in "compute" -> click "Libraries" -> click "Install new" -> click "PyPI" -> set "Package" as "selenium" -> click "install".

install_libraryAlternatively (which is less convenient), you can install it every time in your notebook by running the below command.

%pip install selenium

So let's start webdriver. We can see that Service and binary_location point to driver and binaries, which were downloaded and unpacked by our script.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
s = Service('/tmp/chrome/latest/chromedriver_linux64/chromedriver')
options = webdriver.ChromeOptions()
options.binary_location = "/tmp/chrome/latest/chrome-linux/chrome"
options.add_argument('headless')
options.add_argument('--disable-infobars')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')
options.add_argument('--remote-debugging-port=9222')
options.add_argument('--homedir=/tmp/chrome/chrome-user-data-dir')
options.add_argument('--user-data-dir=/tmp/chrome/chrome-user-data-dir')
prefs = {"download.default_directory":"/tmp/chrome/chrome-user-data-di",
         "download.prompt_for_download":False
}
options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(service=s, options=options)

Let's test webdriver. We will take the last posts from the databricks community and convert them to a dataframe.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.execute("get", {'url': 'https://community.databricks.com/s/discussions?page=1&filter=All'})
date = [elem.text for elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "lightning-formatted-date-time")))]
title = [elem.text for elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "p[class='Sub-heaading1']")))]
from pyspark.sql.types import StringType, StructType, StructField
 
schema = StructType([
    StructField("date", StringType()),
    StructField("title", StringType())
])
df = spark.createDataFrame(list(zip(date, title)), schema=schema)
display(df)

resultsWe can see the latest posts in our dataframe. Now we can quit the driver.

driver.quit()

The version of that article as ready to-run notebook is available at: https://github.com/hubert-dudek/databricks-hubert/blob/main/projects/selenium/chromedriver.py

To import that notebook into databricks, go to the folder in your "workplace" -> from the arrow menu, select "URL" -> click "import" -> put https://raw.githubusercontent.com/hubert-dudek/databricks-hubert/main/projects/selenium/chromedriver... as URL.

import 

10 REPLIES 10

swrd
New Contributor III

I followed your article but got this error message:selenium_not_working 

How do I resolve?

fishjhu
New Contributor II

I am getting the error below. @S Wโ€‹  have you solved yours?Capture

fishjhu
New Contributor II

swrd
New Contributor III

@Fisseha Berhaneโ€‹ Thanks, this worked for me!

However I can't get the browser to open - that would be vital so I can extract the relevant web elements for the automation script to work:

selenium_not_working_v.3.0Any ideas on how to get that done?

swrd
New Contributor III

@Fisseha Berhaneโ€‹ I managed to get pass the error message by using the web-driver module - the next challenge is opening the browser using the "driver.get()" method...

Databricks executes the command "successfully" without opening the requested URL -

selenium_not_working_v.2.0 

Does anyone know how to get that to work?

datascientistms
New Contributor II

I followed these instructions in an AWS backed Databricks platform and can't get past this error every time I run the below code:

Partial Error:

Could not connect to security.ubuntu.com:80

Code:

%sh
/dbfs/databricks/scripts/selenium-install.sh

I have provided the full error at the bottom of this post. Is there anything that I am doing wrong? I looked at the Network ACLs and Security Groups defaulted in the AWS account and it looks like I should have access to in/outbound HTTP(80) ports, but I am not an AWS expert. I added a new Security Group for Outbout 80 access to try and troubleshoot but didn't work and is probably redundent. Could use some help troubleshooting.

I tried running the below as the full error suggested and get am getting simular error messages:

Suggested Code:

%sh
 sudo apt-get update

Full error can found on my StackOverflow post (too long to post here).

I was able to get this fixed by working with our IT department. port 80 is required for the %sh command and our firewall configuration was blocked for port 80 on that particular cloud platform.

I have a new issue though. When trying to run the first command after the pip install selenium command, I am getting this error.

WebDriverException: Message: unknown error: unable to discover open pages

@Hubert Dudekโ€‹ Any ideas?

aa_204
New Contributor II

@Hubert Dudekโ€‹ : I am trying to run the above script but my chrome driver installation is failing intermittently . Can you please sugget some solution.

image

Hubert-Dudek
Esteemed Contributor III

Hi, I will test it again on runtime 12 and also using @Henry Grayโ€‹ discoveries in a few weeks.

@Hubert-Dudek  Hi, thanks for the detailed tutorial. With slight tweaks to the init script I was able to make Selenium work on single-node cluster. However, I haven't had much luck with shared clusters in DB Runtime 14.0. Btw, I'm using Volumes to store both chrome 114 debian package & chromebinary executable.

Haiyangl104_0-1696426137682.png

See attached for the previous steps.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.