cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best way to install and manage a private Python package that has a continuously updating Wheel

darthdickhead
New Contributor III

I'm trying to setup a Workspace Library that is used internally within our organization. This is a Python package, where the source is available on a private GitHub repository, and not accessible on PyPi or the wider internet / surface web.

I managed to clone the private GitHub repository by adding the Github Developer Token to my user settings, but when prompted to add a library - it appears I can upload it to S3 or DBFS as a `Wheel` or `Egg` file (with eggfile going to be deprecated soon). The Python package in question is updated regularly, with a git pull + pip install required at least once a day, and otherwise can happen multiple times in 24 hours.

Was wondering if the only way to use this package within Databricks was to keep uploading newly generated Wheel files into DBFS or S3? Is there some way to quickly synchronize the repositories and install them?

4 REPLIES 4

Anonymous
Not applicable

@Eshwaran Venkat​ 

You can use the Databricks CLI to automate the process of cloning the private GitHub repository and building/uploading the Python package to DBFS as a wheel file. You can schedule this process to run periodically, such as once a day, using a cron job or a similar scheduling mechanism.

  1. Install and configure the Databricks CLI on your local machine or a separate server.
  2. Create a Python script that clones the private GitHub repository, builds the Python package, and uploads it to DBFS as a wheel file. You can use the git command and the setuptools package to perform these tasks.
  3. Add the script to a cron job or a similar scheduling mechanism to run it periodically, such as once a day.
  4. In your Databricks notebooks, install the Python package from the uploaded wheel file in DBFS.

This approach allows you to synchronize the private GitHub repository and install the Python package in Databricks with minimal manual intervention.

Curious then, what would be the best way to iterate and test functions from a python file. Imagine you have a python module that has a few functions that you need to import and use within a databricks notebook that has a pipeline running. Now, as you're running the notebook and getting results, you want to go back to these functions within the external module and edit them , and retry running certain cells. So, a scheduled update wouldn't work too well - just wondering what is the best practice for using an external module such as this which would imply a back and forth edit process between the notebook and the functions in the module.

Note that this is only for development, and during production or when the notebook is running on a scheduled job, then the module and functions can be considered frozen.

Anonymous
Not applicable

@Eshwaran Venkat​ : Providing you more suggestions.

One approach to iterating and testing functions from a Python file in Databricks is to use a development workflow that includes version control and automated testing.

  1. Import the necessary functions from the module into the notebook.
  2. Write code in the notebook that calls those functions and produces results that you can inspect and evaluate.
  3. Modify the functions in the module as needed, and save the changes.
  4. Run the cells in the notebook that use the modified functions to test them and verify that they behave as expected.
  5. If necessary, repeat steps 3 and 4 until you're satisfied with the behavior of the functions.

By using this iterative process, you can quickly modify and test functions in the external module without disrupting the pipeline running in your notebook. Once you're confident in the behavior of the functions, you can freeze the module and functions for production use.

Additionally, you may want to consider using version control, such as Git, to keep track of changes to the external module and to collaborate with others who may be modifying the functions. This can help ensure that changes are tracked and that everyone is working with the same code.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group