03-11-2023 08:29 PM
I'm trying to setup a Workspace Library that is used internally within our organization. This is a Python package, where the source is available on a private GitHub repository, and not accessible on PyPi or the wider internet / surface web.
I managed to clone the private GitHub repository by adding the Github Developer Token to my user settings, but when prompted to add a library - it appears I can upload it to S3 or DBFS as a `Wheel` or `Egg` file (with eggfile going to be deprecated soon). The Python package in question is updated regularly, with a git pull + pip install required at least once a day, and otherwise can happen multiple times in 24 hours.
Was wondering if the only way to use this package within Databricks was to keep uploading newly generated Wheel files into DBFS or S3? Is there some way to quickly synchronize the repositories and install them?
03-13-2023 12:35 AM
@Eshwaran Venkat
You can use the Databricks CLI to automate the process of cloning the private GitHub repository and building/uploading the Python package to DBFS as a wheel file. You can schedule this process to run periodically, such as once a day, using a cron job or a similar scheduling mechanism.
This approach allows you to synchronize the private GitHub repository and install the Python package in Databricks with minimal manual intervention.
03-30-2023 12:24 AM
Curious then, what would be the best way to iterate and test functions from a python file. Imagine you have a python module that has a few functions that you need to import and use within a databricks notebook that has a pipeline running. Now, as you're running the notebook and getting results, you want to go back to these functions within the external module and edit them , and retry running certain cells. So, a scheduled update wouldn't work too well - just wondering what is the best practice for using an external module such as this which would imply a back and forth edit process between the notebook and the functions in the module.
03-30-2023 12:25 AM
Note that this is only for development, and during production or when the notebook is running on a scheduled job, then the module and functions can be considered frozen.
04-01-2023 09:32 PM
@Eshwaran Venkat : Providing you more suggestions.
One approach to iterating and testing functions from a Python file in Databricks is to use a development workflow that includes version control and automated testing.
By using this iterative process, you can quickly modify and test functions in the external module without disrupting the pipeline running in your notebook. Once you're confident in the behavior of the functions, you can freeze the module and functions for production use.
Additionally, you may want to consider using version control, such as Git, to keep track of changes to the external module and to collaborate with others who may be modifying the functions. This can help ensure that changes are tracked and that everyone is working with the same code.
03-18-2023 08:33 AM
Hi @Eshwaran Venkat , We haven't heard from you since the last response from @Suteja Kanuri , and I was checking back to see if her suggestions helped you.
Or else, If you have any solution, please share it with the community, as it can be helpful to others.
Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group