cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Installing linux packages on cluster

TX-Aggie-00
New Contributor III

Hey everyone!  We have a need to utilize libreoffice in one of our automated tasks via a notebook.  I have tried to install via a init script that I attach to the cluster, but sometimes the program gets installed and sometimes it doesn't.  For obvious reasons, I need to guarantee that these tasks run successfully.  Here is my init script that resides in my workspace:  "install_libreoffice.sh":

#!/bin/bash
echo "----------------INIT SCRIPT---------------"
apt-get update
echo "----------------Installing libreoffice---------------"
apt-get install -y libreoffice
echo "----------------Installing python3-uno---------------"
apt-get install -y python3-uno
echo "----------------Installing poppler-utils---------------"
apt-get install -y poppler-utils
echo "----------INIT SCRIPT COMPLETE------------"

When it is successfully installed I can run the following cell and it returns expected results:

import subprocess
import os

result = subprocess.run(["libreoffice", "--version"], capture_output=True, text=True)
print(result.stdout)

It will work and then I will terminate the cluster and start it and it won't work.  I have views the init_script logs and everything looks good, but the program will not be installed and %sh ps will not show the process running.

I have tried to install it in a %sh cell, but that doesn't work either.  What is the best way to get this consistently installed on a cluster?

Thanks,
Scott

6 REPLIES 6

TX-Aggie-00
New Contributor III

Forgot to mention a few items of interest:

  • The cluster is a single node, so I would think installing via "%sh" would be sufficient
  • DRV - 12.2 LTS
  • Node - Standard D8s_vs (Azure)

TX-Aggie-00
New Contributor III

I think I determined the issue, just not sure how best to fix it.  It seems the apt-get repository doesn't always work.  I noticed that when notebook fails, the init_scripts logs show a lot of 404 errors when downloading the packages.  When the workbook is successful, there are no errors and I can see the packages get installed.

I updated the runtime to 15.4 LTS and that seems to be working consistently for now, but I am a bit nervous if this issue will pop up again

Alberto_Umana
Databricks Employee
Databricks Employee

Hello @TX-Aggie-00,

To ensure that LibreOffice is consistently installed on your Databricks cluster without relying on internet access (which can fail sometimes), you can manually download the necessary packages and store them in a Unity Catalog volume or a workspace location. Here’s a step-by-step guide:

  1. Download the Packages:
    • On a local machine, download the .deb packages for LibreOffice, python3-uno, and poppler-utils from a reliable source such as the official repositories or a trusted mirror.
  2. Upload the Packages to Unity Catalog or Workspace:
    • Upload the downloaded .deb files to a Unity Catalog volume or a workspace location (DBFS). You can use the Databricks UI or the Databricks CLI to upload these files. For example, you can use the following CLI command to upload to a Unity Catalog volume:

      databricks fs cp local_path_to_deb_file /Volumes/your_catalog/your_schema/your_volume/
      Bash
  3. Modify the Init Script:
    • Update your init script to install the packages from the local volume instead of downloading them from the internet. Here’s an example of how your init script might look:

      #!/bin/bash
    • echo "----------------INIT SCRIPT---------------"
    • echo "----------------Installing libreoffice---------------"
    • dpkg -i /dbfs/Volumes/your_catalog/your_schema/your_volume/libreoffice.deb
    • echo "----------------Installing python3-uno---------------"
    • dpkg -i /dbfs/Volumes/your_catalog/your_schema/your_volume/python3-uno.deb
    • echo "----------------Installing poppler-utils---------------"
    • dpkg -i /dbfs/Volumes/your_catalog/your_schema/your_volume/poppler-utils.deb
    • echo "----------INIT SCRIPT COMPLETE------------"

TX-Aggie-00
New Contributor III

Thanks Alberto!  There were 42 deb files, so I just changed my script to:

sudo dpkg -i /dbfs/Volumes/your_catalog/your_schema/your_volume/*.deb

The init_script log shows that it unpacks everything, sets them up and the processes triggers, but the package is not actually installed and can not find where it would have been installed to.  I do like this alternative if I can figure out how to get it to work.

Thanks,
Scott

I followed your command and it worked, the only problem is it runs under `libreoffice24.8` command and not `libreoffice`. I ran `which libreoffice24.8` and then create a link: `

sudo ln -s /usr/local/bin/libreoffice24.8 /usr/local/bin/libreoffice` and it is working now when I use `libreoffice`.

Thanks for posting your solution!  Hopefully it helps someone else with the same issue.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group