- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-08-2025 01:46 PM - edited 03-08-2025 01:48 PM
This issue is very bizarre and was very cumbersome to deal with.. but i think i found a solution i can live with
For me, using 2 venv just complicates to project in a way i an not willing to maintain.
Spark Connect, although promising, lacks in a lot of ares compared to the databricks connect, to name a few:
- session needs to be rolled every hour due to the token going bad,
- some cluster lifecycles are just not recognized by a regular Spark Connect, like when the cluster is warming up, databricks will wait for the cluster to be ready, while spark connect will raise an exception. One CAN map the different scenarios and build handling logic, but why would he?
As I need this specifically for unit-testing, an I use pytest, I downloaded the original pyspark to a directory on my root folder with:
pip install --target pyspark_unpatched pyspark==X.Y.Z
i created this conftest.py in the root of the project:
def override_databricks_spark() -> None:
repo_root = Path(__file__).parent.resolve()
unpatched_pyspark_dir = os.path.expanduser(repo_root / "pyspark_unpatched")
sys.path.insert(0, unpatched_pyspark_dir)
import pyspark as unpatched_pyspark
sys.path.remove(unpatched_pyspark_dir)
sys.modules["pyspark"] = unpatched_pyspark
print(f"Overridden pyspark with module from {unpatched_pyspark_dir}")
override_databricks_spark()
Because conftest loads first, It allows you do make sure the the unpatched version of pytest loads first, letting you test peacefully
The main drawback of this is you must use the `pyspark` import statement, using Databricks will break the tests (not that it matters, at least i think it doesn't). and obvoiusly, downloading pyspark, but considering the alternatives, i am willing to bite that bullet.
This was my ref from stack overflow