โ08-18-2021 09:11 AM
Hello all,
As described in the title, here's my problem:
1. I'm using databricks-connect in order to send jobs to a databricks cluster
2. The "local" environment is an AWS EC2
3. I want to read a CSV file that is in DBFS (databricks) with
pd.read_csv()
. Reason for that is that it's too big to do spark.read.csv()
and then .toPandas()
(crashes everytime).
4. When I run
pd.read_csv("/dbfs/FileStore/some_file")
I get a FileNotFoundError
because it points to the local S3 buckets rather than to dbfs. Is there a way to do what I want to do (e.g. change where pandas looks for files with some options)?
Thanks a lot in advance!
โ11-24-2021 07:49 AM
Hi @Kaniz Fatmaโ ,
โ
No, I still haven't found the solution and I can't read from dbfsโ (not with pandas.read_csv).
โ
I meant to say that the setup tests pass, so the issue is not in the setup)
โ
โ11-24-2021 07:54 AM
Hi @Arturo Amadorโ , Can you please test once by changing your DBR version to less than 7?
โ11-24-2021 07:58 AM
Hi @Kaniz Fatmaโ ,
โ
I will try that and report!โ
โ11-24-2021 07:59 AM
Thanks!
โ11-24-2021 08:25 AM
Hi @Kaniz Fatmaโ ,
I can confirm that after downgrading to the DBR 6.4, and passing all the tests in:
databricks-connect test
I am still getting the FileNotFound error when trying to use
pd.read_csv('/dbfs/mnt/datalake_gen2_data/some.csv'')
โ11-25-2021 12:18 AM
Hi,
โ
After some research, I have found out that the pandas API reads only local files. This means that even if a read_csv command works in the Databricks Notebook environment, it will not work when using databricks-connect (pandas reads locally from within the notebook environment).
โ
A work around is to use the pyspark spark.read.format('csv') API to read the remote files and append a ".toPandas()" at the end so that we get a pandas dataframe.
df_pandas = spark.read.format('csv').options(header='true').load('path/in/the/remote/dbfs/filesystem/').toPandas()
โ11-25-2021 10:09 AM
Hi Arturooa,
It seems we are holding a similar conclusion. Just a quick question, what do you mean for 'local files'? I've uploaded my files into dbfs, are they not local files after that?
Thanks
โ11-26-2021 03:40 AM
Hi @Yuanyue Liuโ ,
The spark engine is connected to the (remote) workers on Databricks, this is the reason why you can read the data from the dbfs by use of:
spark.read.format('csv').options(header='true').load('path/in/the/remote/dbfs/filesystem/')
The same happens with dbutils, for example. You can read files in the dbfs with for example:
dbutils.fs.ls(files_path)
Pandas does not connect directly to the remote filesystem (dbfs). That is the reason why you have to first read the remote data with spark and then transform to an in-memory dataframe (pandas).
I am using pandas profiling and after I make an HTML report, which is written to the local driver (since pandas_profiling does not connect to the remote filesystem either), I use dbutils to upload data to my mnt drive in dbfs (that comes from a datalake gen2).
I hope this helps!
โ12-06-2021 09:10 AM
@Arturo Amadorโ - Would you be happy to mark your answer as best if the issue has been resolved by what you found? That will help others find your answer more quickly in the future.
โ12-15-2021 02:14 AM
Hi, @Piper Wilsonโ ,
it is actually @hamzatazib96โ that needs to mark the answer as best ๐
โ12-15-2021 10:47 AM
WHOOPS! Thank you, @Arturo Amadorโ!
@hamzatazib96โ - If any of the answers solved the issue, would you be happy to mark it as best?
โ12-15-2021 11:42 AM
Done! Thanks all for the answers and help!
Best way I found around this was to simply do an SCP transfer using the databricks exe from DBFS to an S3 bucket. The flow was:
DBFS -> EC2 Local -> S3 bucket
โ04-29-2023 02:21 AM
โ07-19-2023 01:13 PM
Please guys I need your help, I got the same issue still after readed all your comments.
I am using Databricks-connect(version 13.1) on pycharm and trying to load file that are on the dbfs storage.
spark = DatabricksSession.builder.remote(
host=host, token=token, cluster_id=c_id).getOrCreate()
path="dbfs:/mnt/storage/file.csv"
df = spark.read.format("csv").option("header", "true").load(path)
Give me the error:
pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = "INVALID_STATE: Unsupported 12.2.x-scala2.12 0611-073104-1kjepouv on Databricks Runtime Version. (requestId=8c278ab3-348a-4fa1-9797-6d58d571eeff)"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"INVALID_STATE: Unsupported 12.2.x-scala2.12 0611-073104-1kjepouv on Databricks Runtime Version. (requestId=8c278ab3-348a-4fa1-9797-6d58d571eeff)", grpc_status:9, created_time:"2023-07-19T19:52:47.881727713+00:00"}"
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group