Databricks

naveen_marthala · ‎05-01-2022

I have an mlflow server with `--serve-artifacts` and with postgres as `--backend-store-uri`. The machine(container with base image python:3.9-bullseye) running the server has git installed which is available on path.

I am logging from jupyter-notebooks and these are on containers too(with base image python:3.9-slim-bullseye) and doesn't have git installed.

When I try to auto-log from client like this:

mlflow.sklearn.autolog()
 
# prepare training data
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3
 
# train a model
model = LinearRegression()
model.fit(X, y)
run_id = mlflow.last_active_run().info.run_id
print("Logged data and model in run {}".format(run_id))

I get warning that git is not installed and some more warnings and errors:

2022/05/01 14:21:41 WARNING mlflow.tracking.context.git_context: Failed to import Git (the Git executable is probably not on your PATH), so Git SHA is not available. Error: Failed to initialize: Bad git executable.
The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh()
 
All git commands will error until this is rectified.
 
This initial warning can be silenced or aggravated in the future by setting the
$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
    - quiet|q|silence|s|none|n|0: for no warning or exception
    - warn|w|warning|1: for a printed warning
    - error|e|raise|r|2: for a raised exception
 
Example:
    export GIT_PYTHON_REFRESH=quiet
 
2022/05/01 14:21:41 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'e914209e05d449e6af817d0d692b1012', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2022/05/01 14:22:45 WARNING mlflow.utils.autologging_utils: Encountered unexpected error during sklearn autologging: API request to http://host.docker.internal:5000/api/2.0/mlflow-artifacts/artifacts/1/e914209e05d449e6af817d0d692b10... failed with exception HTTPConnectionPool(host='host.docker.internal', port=5000): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/1/e914209e05d449e6af817d0d692b1012/artifacts/model/model.pkl (Caused by ResponseError('too many 500 error responses'))
Logged data and model in run e914209e05d449e6af817d0d692b1012

I couldn't figure out why clients need to have git installed and have been under the assumption that clients must only be able to send HTTP requests to server and doesn't need to have anything else installed? what am I missing and how can i avoid that warning, not by not seeing it, but actually fix what's causing it?

Kaniz · ‎05-12-2022

Hi @Naveen Marthala , This is indeed an MLflow project, and it necessarily requires git.

View solution in original post

Hubert-Dudek · ‎05-01-2022

When it is part of the MLflow Project, it requires git.

naveen_marthala · ‎05-01-2022

@Hubert Dudek , I still haven't made anything a project, in the context of MlFlow. So, would I need MlFlow irrespective of what I am trying to do?

Kaniz · ‎05-12-2022

Hi @Naveen Marthala , This is indeed an MLflow project, and it necessarily requires git.

Kaniz · ‎05-18-2022

Hi @Naveen Marthala , Just a friendly follow-up. Do you still need help or the above responses help you to find the solution? Please let us know.

Databricks

why does the client need to have git installed for auto-logging to an mlflow server running in "--serve-artifacts" mode?

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!