solanoam
New Contributor II

This issue is very bizarre and was very cumbersome to deal with.. but i think i found a solution i can live with

For me, using 2 venv just complicates to project in a way i an not willing to maintain.
Spark Connect, although promising, lacks in a lot of ares compared to the databricks connect, to name a few:

  1. session needs to be rolled every hour due to the token going bad, 
  2. some cluster lifecycles are just not recognized by a regular Spark Connect, like when the cluster is warming up, databricks will wait for the cluster to be ready, while spark connect will raise an exception. One CAN map the different scenarios and build handling logic, but why would he?

As I need this specifically for unit-testing, an I use pytest, I downloaded the original pyspark to a directory on my root folder with:

 

pip install --target pyspark_unpatched pyspark==X.Y.Z

 

i created this conftest.py in the root of the project:

 

def override_databricks_spark() -> None:
    repo_root = Path(__file__).parent.resolve()
    unpatched_pyspark_dir = os.path.expanduser(repo_root / "pyspark_unpatched")
    sys.path.insert(0, unpatched_pyspark_dir)
    import pyspark as unpatched_pyspark
    sys.path.remove(unpatched_pyspark_dir)
    sys.modules["pyspark"] = unpatched_pyspark
    print(f"Overridden pyspark with module from {unpatched_pyspark_dir}")

override_databricks_spark()

 

 Because conftest loads first, It allows you do make sure the the unpatched version of pytest loads first, letting you test peacefully

The main drawback of this is you must use the `pyspark` import statement, using Databricks will break the tests (not that it matters, at least i think it doesn't). and obvoiusly, downloading pyspark, but considering the alternatives, i am willing to bite that bullet.

This was my ref from stack overflow