Databricks Community

DiCamps · ‎09-29-2022

Hello guys,

I'm trying to migrate a python project from Pandas to Pandas API on Spark, on Azure Databricks using MLFlow on a conda env.

The thing is I'm getting the next error:

Traceback (most recent call last):

File "/databricks/mlflow/projects/x/data_validation.py", line 13, in <module>

import pyspark.pandas as ps

ModuleNotFoundError: No module named 'pyspark.pandas'

Isn't the package supposed to be part of Spark already? We're using clusters on runtime version 10.4 LTS, which I understand is having Apache Spark 3.2.1, and I've seen that Pandas API on Spark should be included since 3.2

I also tried to install it from my config file, the one I use to create the conda env, but it's not working 😞

-werners- · ‎09-29-2022

it should be yes.

can you elaborate on how you create your notebook (and the conda env you talk about)?

View solution in original post

-werners- · ‎09-29-2022

it should be yes.

can you elaborate on how you create your notebook (and the conda env you talk about)?

Databricks Community

Installing pyspark.pandas

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences