<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Selenium chrome driver on databricks driver On the databricks community, I see repeated problems regarding the selenium installation on the databricks... in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23097#M15911</link>
    <description>&lt;P&gt;Hi, I will test it again on runtime 12 and also using @Henry Gray​&amp;nbsp;discoveries in a few weeks.&lt;/P&gt;</description>
    <pubDate>Thu, 22 Dec 2022 20:56:14 GMT</pubDate>
    <dc:creator>Hubert-Dudek</dc:creator>
    <dc:date>2022-12-22T20:56:14Z</dc:date>
    <item>
      <title>Selenium chrome driver on databricks driver On the databricks community, I see repeated problems regarding the selenium installation on the databricks...</title>
      <link>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23088#M15902</link>
      <description>&lt;P&gt;&lt;B&gt;Selenium chrome driver on databricks driver&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;On the databricks community, I see repeated problems regarding the selenium installation on the databricks driver. Installing selenium on databricks can be surprising, but for example, sometimes we need to grab some datasets behind fancy authentication, and selenium is the most accessible tool to do that. Of course, always remember to check the most uncomplicated alternatives first. For example, if we need to download an HTML file, we can use SparkContext.addFile() or just use the requests library. If we need to parse HTML without simulating user actions or downloading complicated pages, we can use BeautifulSoap. Please remember that selenium is running on the driver only (workers are not utilized), so just for the selenium part single node cluster is the preferred setting.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Installation&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The easiest solution is to use apt-get to install ubuntu packages, but often version in the ubuntu repo is outdated. Recently that solution stopped working for me, and I decided to take a different approach and to get the driver and binaries from chromium-browser-snapshots &lt;A href="https://commondatastorage.googleapis.com/chromium-browser-snapshots/index.html" alt="https://commondatastorage.googleapis.com/chromium-browser-snapshots/index.html" target="_blank"&gt;https://commondatastorage.googleapis.com/chromium-browser-snapshots/index.html&lt;/A&gt;&amp;nbsp;Below script download the newest version of browser binaries and driver. Everything is saved to /tmp/chrome directory. We must also set the chrome home directory to /tmp/chrome/chrome-user-data-dir. Sometimes, chromium complains about missing libraries. That's why we also install libgbm-dev. The below script will create a bash file implementing mentioned steps.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;dbutils.fs.mkdirs("dbfs:/databricks/scripts/")
dbutils.fs.put("/databricks/scripts/selenium-install.sh","""
#!/bin/bash
%sh
LAST_VERSION="https://www.googleapis.com/download/storage/v1/b/chromium-browser-snapshots/o/Linux_x64%2FLAST_CHANGE?alt=media"
VERSION=$(curl -s -S $LAST_VERSION)
if [ -d $VERSION ] ; then
  echo "version already installed"
  exit
fi
&amp;nbsp;
rm -rf /tmp/chrome/$VERSION
mkdir -p /tmp/chrome/$VERSION
&amp;nbsp;
URL="https://www.googleapis.com/download/storage/v1/b/chromium-browser-snapshots/o/Linux_x64%2F$VERSION%2Fchrome-linux.zip?alt=media"
ZIP="${VERSION}-chrome-linux.zip"
&amp;nbsp;
curl -# $URL &amp;gt; /tmp/chrome/$ZIP
unzip /tmp/chrome/$ZIP -d /tmp/chrome/$VERSION
&amp;nbsp;
URL="https://www.googleapis.com/download/storage/v1/b/chromium-browser-snapshots/o/Linux_x64%2F$VERSION%2Fchromedriver_linux64.zip?alt=media"
ZIP="${VERSION}-chromedriver_linux64.zip"
&amp;nbsp;
curl -# $URL &amp;gt; /tmp/chrome/$ZIP
unzip /tmp/chrome/$ZIP -d /tmp/chrome/$VERSION
&amp;nbsp;
mkdir -p /tmp/chrome/chrome-user-data-dir
&amp;nbsp;
rm -f /tmp/chrome/latest
ln -s /tmp/chrome/$VERSION /tmp/chrome/latest
&amp;nbsp;
# to avoid errors about missing libraries
sudo apt-get update
sudo apt-get install -y libgbm-dev
""", True)
display(dbutils.fs.ls("dbfs:/databricks/scripts/"))&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The script was saved to DBFS storage as /dbfs/databricks/scripts/selenium-install.sh We can set it as an init script for the server. Click your cluster in "compute" -&amp;gt; click "Edit" -&amp;gt; "configuration" tab -&amp;gt; scroll down to "Advanced options" -&amp;gt; click "Init Scripts" -&amp;gt; select "DBFS" and set "Init script path" as "/dbfs/databricks/scripts/selenium-install.sh" -&amp;gt; click "add".&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="init"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1230i453C27A6A7FC3CF6/image-size/large?v=v2&amp;amp;px=999" role="button" title="init" alt="init" /&gt;&lt;/span&gt;If you haven't set the init script, please run the below command.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%sh
/dbfs/databricks/scripts/selenium-install.sh&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Now we can install selenium. Click your cluster in "compute" -&amp;gt; click "Libraries" -&amp;gt; click "Install new" -&amp;gt; click "PyPI" -&amp;gt; set "Package" as "selenium" -&amp;gt; click "install".&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="install_library"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1243i1245B0B6DB162FCC/image-size/large?v=v2&amp;amp;px=999" role="button" title="install_library" alt="install_library" /&gt;&lt;/span&gt;Alternatively (which is less convenient), you can install it every time in your notebook by running the below command.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%pip install selenium&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;So let's start webdriver. We can see that Service and binary_location point to driver and binaries, which were downloaded and unpacked by our script.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from selenium import webdriver
from selenium.webdriver.chrome.service import Service
s = Service('/tmp/chrome/latest/chromedriver_linux64/chromedriver')
options = webdriver.ChromeOptions()
options.binary_location = "/tmp/chrome/latest/chrome-linux/chrome"
options.add_argument('headless')
options.add_argument('--disable-infobars')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')
options.add_argument('--remote-debugging-port=9222')
options.add_argument('--homedir=/tmp/chrome/chrome-user-data-dir')
options.add_argument('--user-data-dir=/tmp/chrome/chrome-user-data-dir')
prefs = {"download.default_directory":"/tmp/chrome/chrome-user-data-di",
         "download.prompt_for_download":False
}
options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(service=s, options=options)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Let's test webdriver. We will take the last posts from the databricks community and convert them to a dataframe.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.execute("get", {'url': 'https://community.databricks.com/s/discussions?page=1&amp;amp;filter=All'})
date = [elem.text for elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "lightning-formatted-date-time")))]
title = [elem.text for elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "p[class='Sub-heaading1']")))]&lt;/CODE&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql.types import StringType, StructType, StructField
&amp;nbsp;
schema = StructType([
    StructField("date", StringType()),
    StructField("title", StringType())
])
df = spark.createDataFrame(list(zip(date, title)), schema=schema)
display(df)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="results"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1220i87A432A7E072B396/image-size/large?v=v2&amp;amp;px=999" role="button" title="results" alt="results" /&gt;&lt;/span&gt;We can see the latest posts in our dataframe. Now we can quit the driver.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;driver.quit()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The version of that article as ready to-run notebook is available at: &lt;A href="https://github.com/hubert-dudek/databricks-hubert/blob/main/projects/selenium/chromedriver.py%5D" alt="https://github.com/hubert-dudek/databricks-hubert/blob/main/projects/selenium/chromedriver.py%5D" target="_blank"&gt;https://github.com/hubert-dudek/databricks-hubert/blob/main/projects/selenium/chromedriver.py&lt;/A&gt;&lt;/P&gt;&lt;P&gt;To import that notebook into databricks, go to the folder in your "workplace" -&amp;gt; from the arrow menu, select "URL" -&amp;gt; click "import" -&amp;gt; put &lt;A href="https://raw.githubusercontent.com/hubert-dudek/databricks-hubert/main/projects/selenium/chromedriver.py%5D" alt="https://raw.githubusercontent.com/hubert-dudek/databricks-hubert/main/projects/selenium/chromedriver.py%5D" target="_blank"&gt;https://raw.githubusercontent.com/hubert-dudek/databricks-hubert/main/projects/selenium/chromedriver.py&lt;/A&gt;&amp;nbsp;as URL.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="import"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1224iC06B88AAA6DA6AB0/image-size/large?v=v2&amp;amp;px=999" role="button" title="import" alt="import" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 09 Nov 2022 14:12:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23088#M15902</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-11-09T14:12:33Z</dc:date>
    </item>
    <item>
      <title>Re: Selenium chrome driver on databricks driver On the databricks community, I see repeated problems regarding the selenium installation on the databricks...</title>
      <link>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23089#M15903</link>
      <description>&lt;P&gt;I followed your article but got this error message:&lt;span class="lia-inline-image-display-wrapper" image-alt="selenium_not_working"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1223iB8AD8FA263634A99/image-size/large?v=v2&amp;amp;px=999" role="button" title="selenium_not_working" alt="selenium_not_working" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;How do I resolve? &lt;/P&gt;</description>
      <pubDate>Fri, 11 Nov 2022 02:29:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23089#M15903</guid>
      <dc:creator>swrd</dc:creator>
      <dc:date>2022-11-11T02:29:55Z</dc:date>
    </item>
    <item>
      <title>Re: Selenium chrome driver on databricks driver On the databricks community, I see repeated problems regarding the selenium installation on the databricks...</title>
      <link>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23090#M15904</link>
      <description>&lt;P&gt;I am getting the error below. @S W​&amp;nbsp; have you solved yours?&lt;span class="lia-inline-image-display-wrapper" image-alt="Capture"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1214iBFC51A8D3D515A21/image-size/large?v=v2&amp;amp;px=999" role="button" title="Capture" alt="Capture" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 14 Nov 2022 20:20:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23090#M15904</guid>
      <dc:creator>fishjhu</dc:creator>
      <dc:date>2022-11-14T20:20:43Z</dc:date>
    </item>
    <item>
      <title>Re: Selenium chrome driver on databricks driver On the databricks community, I see repeated problems regarding the selenium installation on the databricks...</title>
      <link>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23091#M15905</link>
      <description>&lt;P&gt;Gray's script from the link below worked for me.&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.databricks.com/s/question/0D58Y00009PlBaaSAF/errors-using-seleniumchromedriver-in-databricks" target="test_blank"&gt;https://community.databricks.com/s/question/0D58Y00009PlBaaSAF/errors-using-seleniumchromedriver-in-databricks&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 15 Nov 2022 02:06:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23091#M15905</guid>
      <dc:creator>fishjhu</dc:creator>
      <dc:date>2022-11-15T02:06:41Z</dc:date>
    </item>
    <item>
      <title>Re: Selenium chrome driver on databricks driver On the databricks community, I see repeated problems regarding the selenium installation on the databricks...</title>
      <link>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23092#M15906</link>
      <description>&lt;P&gt;@Fisseha Berhane​ Thanks, this worked for me! &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;However I can't get the browser to open - that would be vital so I can extract the relevant web elements for the automation script to work:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="selenium_not_working_v.3.0"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1239i89AB02BEAB1F2AD2/image-size/large?v=v2&amp;amp;px=999" role="button" title="selenium_not_working_v.3.0" alt="selenium_not_working_v.3.0" /&gt;&lt;/span&gt;Any ideas on how to get that done?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 18 Nov 2022 12:51:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23092#M15906</guid>
      <dc:creator>swrd</dc:creator>
      <dc:date>2022-11-18T12:51:45Z</dc:date>
    </item>
    <item>
      <title>Re: Selenium chrome driver on databricks driver On the databricks community, I see repeated problems regarding the selenium installation on the databricks...</title>
      <link>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23093#M15907</link>
      <description>&lt;P&gt;@Fisseha Berhane​&amp;nbsp;I managed to get pass the error message by using the web-driver module - the next challenge is opening the browser using the "driver.get()" method...&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Databricks executes the command "successfully" &lt;B&gt;without &lt;/B&gt;opening the requested URL - &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="selenium_not_working_v.2.0"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1222iA23AE464DDB1C08D/image-size/large?v=v2&amp;amp;px=999" role="button" title="selenium_not_working_v.2.0" alt="selenium_not_working_v.2.0" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Does anyone know how to get that to work?&lt;/P&gt;</description>
      <pubDate>Fri, 18 Nov 2022 13:14:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23093#M15907</guid>
      <dc:creator>swrd</dc:creator>
      <dc:date>2022-11-18T13:14:06Z</dc:date>
    </item>
    <item>
      <title>Re: Selenium chrome driver on databricks driver On the databricks community, I see repeated problems regarding the selenium installation on the databricks...</title>
      <link>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23096#M15910</link>
      <description>&lt;P&gt; @Hubert Dudek​&amp;nbsp;: I am trying to run the above script but my chrome driver installation is failing intermittently . Can you please sugget some solution.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1227iAB6AB806A24F1CC3/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 15 Dec 2022 11:56:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23096#M15910</guid>
      <dc:creator>aa_204</dc:creator>
      <dc:date>2022-12-15T11:56:45Z</dc:date>
    </item>
    <item>
      <title>Re: Selenium chrome driver on databricks driver On the databricks community, I see repeated problems regarding the selenium installation on the databricks...</title>
      <link>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23097#M15911</link>
      <description>&lt;P&gt;Hi, I will test it again on runtime 12 and also using @Henry Gray​&amp;nbsp;discoveries in a few weeks.&lt;/P&gt;</description>
      <pubDate>Thu, 22 Dec 2022 20:56:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/23097#M15911</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-12-22T20:56:14Z</dc:date>
    </item>
    <item>
      <title>Re: Selenium chrome driver on databricks driver On the databricks community, I see repeated problems</title>
      <link>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/48147#M28254</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25346"&gt;@Hubert-Dudek&lt;/a&gt;&amp;nbsp; Hi, thanks for the detailed tutorial. With slight tweaks to the init script I was able to make Selenium work on single-node cluster. However, I haven't had much luck with shared clusters in DB Runtime 14.0. Btw, I'm using Volumes to store both chrome 114 debian package &amp;amp; chromebinary executable.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Haiyangl104_0-1696426137682.png" style="width: 728px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/4235i27697992C27A8159/image-dimensions/728x191/is-moderation-mode/true?v=v2" width="728" height="191" role="button" title="Haiyangl104_0-1696426137682.png" alt="Haiyangl104_0-1696426137682.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;See attached for the previous steps.&lt;/P&gt;</description>
      <pubDate>Wed, 04 Oct 2023 13:39:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/48147#M28254</guid>
      <dc:creator>Haiyangl104</dc:creator>
      <dc:date>2023-10-04T13:39:16Z</dc:date>
    </item>
    <item>
      <title>Re: Selenium chrome driver on databricks driver On the databricks community, I see repeated problems</title>
      <link>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/91562#M38201</link>
      <description>&lt;P&gt;Hi Hubert-Dudek,&lt;/P&gt;&lt;P&gt;Are there any updates to your article?&amp;nbsp; I have struggling to get databricks to recognise a Seleniumbase driver. I think the error might actually be a permissions problem as the error is:&lt;BR /&gt;&lt;SPAN class=""&gt;WebDriverException: &lt;/SPAN&gt;&lt;SPAN&gt;Message: Can not connect to the Service /local_disk0/.ephemeral_nfs/envs/pythonEnv-0000-xxx.../lib/python3.11/site-packages/seleniumbase/drivers/uc_driver&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Tue, 24 Sep 2024 11:19:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/selenium-chrome-driver-on-databricks-driver-on-the-databricks/m-p/91562#M38201</guid>
      <dc:creator>iSinnerman</dc:creator>
      <dc:date>2024-09-24T11:19:37Z</dc:date>
    </item>
  </channel>
</rss>

