Databricks Community

darthdickhead · ‎03-11-2023

I'm trying to setup a Workspace Library that is used internally within our organization. This is a Python package, where the source is available on a private GitHub repository, and not accessible on PyPi or the wider internet / surface web.

I managed to clone the private GitHub repository by adding the Github Developer Token to my user settings, but when prompted to add a library - it appears I can upload it to S3 or DBFS as a `Wheel` or `Egg` file (with eggfile going to be deprecated soon). The Python package in question is updated regularly, with a git pull + pip install required at least once a day, and otherwise can happen multiple times in 24 hours.

Was wondering if the only way to use this package within Databricks was to keep uploading newly generated Wheel files into DBFS or S3? Is there some way to quickly synchronize the repositories and install them?

Anonymous · ‎03-13-2023

@Eshwaran Venkat

You can use the Databricks CLI to automate the process of cloning the private GitHub repository and building/uploading the Python package to DBFS as a wheel file. You can schedule this process to run periodically, such as once a day, using a cron job or a similar scheduling mechanism.

Install and configure the Databricks CLI on your local machine or a separate server.
Create a Python script that clones the private GitHub repository, builds the Python package, and uploads it to DBFS as a wheel file. You can use the git command and the setuptools package to perform these tasks.
Add the script to a cron job or a similar scheduling mechanism to run it periodically, such as once a day.
In your Databricks notebooks, install the Python package from the uploaded wheel file in DBFS.

This approach allows you to synchronize the private GitHub repository and install the Python package in Databricks with minimal manual intervention.

darthdickhead · ‎03-30-2023

Curious then, what would be the best way to iterate and test functions from a python file. Imagine you have a python module that has a few functions that you need to import and use within a databricks notebook that has a pipeline running. Now, as you're running the notebook and getting results, you want to go back to these functions within the external module and edit them , and retry running certain cells. So, a scheduled update wouldn't work too well - just wondering what is the best practice for using an external module such as this which would imply a back and forth edit process between the notebook and the functions in the module.

darthdickhead · ‎03-30-2023

Note that this is only for development, and during production or when the notebook is running on a scheduled job, then the module and functions can be considered frozen.

Anonymous · ‎04-01-2023

@Eshwaran Venkat : Providing you more suggestions.

One approach to iterating and testing functions from a Python file in Databricks is to use a development workflow that includes version control and automated testing.

Import the necessary functions from the module into the notebook.
Write code in the notebook that calls those functions and produces results that you can inspect and evaluate.
Modify the functions in the module as needed, and save the changes.
Run the cells in the notebook that use the modified functions to test them and verify that they behave as expected.
If necessary, repeat steps 3 and 4 until you're satisfied with the behavior of the functions.

By using this iterative process, you can quickly modify and test functions in the external module without disrupting the pipeline running in your notebook. Once you're confident in the behavior of the functions, you can freeze the module and functions for production use.

Additionally, you may want to consider using version control, such as Git, to keep track of changes to the external module and to collaborate with others who may be modifying the functions. This can help ensure that changes are tracked and that everyone is working with the same code.

Databricks Community

Best way to install and manage a private Python package that has a continuously updating Wheel

Photos

Join Us as a Local Community Builder!

Exciting Opportunity to Collaborate with Us!

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Share Your Thoughts on Databricks & Get Rewarded!

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April