Pandas finds parquet file, Spark does not
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-29-2023 12:02 PM
I am having an issue with Databricks (Community Edition) where I can use Pandas to read a parquet file into a dataframe, but when I use Spark it states the file doesn't exist. I have tried reformatting the file path for spark but I can't seem to find a format that it will accept.
Any ideas?
Pandas:
parquet_file_path = "/dbfs/green_tripdata_2022-02.parquet"
df = pd.read_parquet(parquet_file_path, engine='pyarrow')
display(df)
Result:
Spark:
parquet_file_path = "/dbfs/green_tripdata_2022-02.parquet"
df = spark.read.parquet(parquet_file_path)
df.show()
AnalysisException: [PATH_NOT_FOUND] Path does not exist: dbfs:/dbfs/green_tripdata_2022-02.parquet.
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-29-2023 03:02 PM
Can you check those 3 options ? I don't remember which one will work and can't test it now, but I am sore one or two of those will work 🙂
parquet_file_path = "/green_tripdata_2022-02.parquet"
parquet_file_path = "green_tripdata_2022-02.parquet"
parquet_file_path = "dbfs:/green_tripdata_2022-02.parquet"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-03-2024 02:30 PM
Are you getting any error messages? what happens when you do a "ls /dbfs/"? are you able to list all the parquet files?