Databricks Community

JonW · ‎12-29-2023

I am having an issue with Databricks (Community Edition) where I can use Pandas to read a parquet file into a dataframe, but when I use Spark it states the file doesn't exist. I have tried reformatting the file path for spark but I can't seem to find a format that it will accept.

Any ideas?

Pandas:

parquet_file_path = "/dbfs/green_tripdata_2022-02.parquet"
df = pd.read_parquet(parquet_file_path, engine='pyarrow')
display(df)

Result:

Spark:

parquet_file_path = "/dbfs/green_tripdata_2022-02.parquet"
df = spark.read.parquet(parquet_file_path)
df.show()

Result:
AnalysisException: [PATH_NOT_FOUND] Path does not exist: dbfs:/dbfs/green_tripdata_2022-02.parquet.

Wojciech_BUK · ‎12-29-2023

Can you check those 3 options ? I don't remember which one will work and can't test it now, but I am sore one or two of those will work 🙂

parquet_file_path = "/green_tripdata_2022-02.parquet"

parquet_file_path = "green_tripdata_2022-02.parquet"

parquet_file_path = "dbfs:/green_tripdata_2022-02.parquet"

jose_gonzalez · ‎01-03-2024

Are you getting any error messages? what happens when you do a "ls /dbfs/"? are you able to list all the parquet files?

Databricks Community

Pandas finds parquet file, Spark does not

Photos

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks