Databricks

darthdickhead · ‎03-11-2023

I'm trying to setup a Workspace Library that is used internally within our organization. This is a Python package, where the source is available on a private GitHub repository, and not accessible on PyPi or the wider internet / surface web.

I managed to clone the private GitHub repository by adding the Github Developer Token to my user settings, but when prompted to add a library - it appears I can upload it to S3 or DBFS as a `Wheel` or `Egg` file (with eggfile going to be deprecated soon). The Python package in question is updated regularly, with a git pull + pip install required at least once a day, and otherwise can happen multiple times in 24 hours.

Was wondering if the only way to use this package within Databricks was to keep uploading newly generated Wheel files into DBFS or S3? Is there some way to quickly synchronize the repositories and install them?

Anonymous · ‎03-13-2023

@Eshwaran Venkat

You can use the Databricks CLI to automate the process of cloning the private GitHub repository and building/uploading the Python package to DBFS as a wheel file. You can schedule this process to run periodically, such as once a day, using a cron job or a similar scheduling mechanism.

Install and configure the Databricks CLI on your local machine or a separate server.
Create a Python script that clones the private GitHub repository, builds the Python package, and uploads it to DBFS as a wheel file. You can use the git command and the setuptools package to perform these tasks.
Add the script to a cron job or a similar scheduling mechanism to run it periodically, such as once a day.
In your Databricks notebooks, install the Python package from the uploaded wheel file in DBFS.

This approach allows you to synchronize the private GitHub repository and install the Python package in Databricks with minimal manual intervention.

darthdickhead · ‎03-30-2023

Curious then, what would be the best way to iterate and test functions from a python file. Imagine you have a python module that has a few functions that you need to import and use within a databricks notebook that has a pipeline running. Now, as you're running the notebook and getting results, you want to go back to these functions within the external module and edit them , and retry running certain cells. So, a scheduled update wouldn't work too well - just wondering what is the best practice for using an external module such as this which would imply a back and forth edit process between the notebook and the functions in the module.

darthdickhead · ‎03-30-2023

Note that this is only for development, and during production or when the notebook is running on a scheduled job, then the module and functions can be considered frozen.

Anonymous · ‎04-01-2023

@Eshwaran Venkat : Providing you more suggestions.

One approach to iterating and testing functions from a Python file in Databricks is to use a development workflow that includes version control and automated testing.

Import the necessary functions from the module into the notebook.
Write code in the notebook that calls those functions and produces results that you can inspect and evaluate.
Modify the functions in the module as needed, and save the changes.
Run the cells in the notebook that use the modified functions to test them and verify that they behave as expected.
If necessary, repeat steps 3 and 4 until you're satisfied with the behavior of the functions.

By using this iterative process, you can quickly modify and test functions in the external module without disrupting the pipeline running in your notebook. Once you're confident in the behavior of the functions, you can freeze the module and functions for production use.

Additionally, you may want to consider using version control, such as Git, to keep track of changes to the external module and to collaborate with others who may be modifying the functions. This can help ensure that changes are tracked and that everyone is working with the same code.

Kaniz · ‎03-18-2023

Hi @Eshwaran Venkat , We haven't heard from you since the last response from @Suteja Kanuri , and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Databricks

Best way to install and manage a private Python package that has a continuously updating Wheel

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs