Re: Read file from dbfs with pd.read_csv() using d...

Anonymous · ‎11-26-2021

Hi @Yuanyue Liu ,

The spark engine is connected to the (remote) workers on Databricks, this is the reason why you can read the data from the dbfs by use of:

spark.read.format('csv').options(header='true').load('path/in/the/remote/dbfs/filesystem/')

The same happens with dbutils, for example. You can read files in the dbfs with for example:

dbutils.fs.ls(files_path)

Pandas does not connect directly to the remote filesystem (dbfs). That is the reason why you have to first read the remote data with spark and then transform to an in-memory dataframe (pandas).

I am using pandas profiling and after I make an HTML report, which is written to the local driver (since pandas_profiling does not connect to the remote filesystem either), I use dbutils to upload data to my mnt drive in dbfs (that comes from a datalake gen2).

I hope this helps!

Anonymous · ‎12-06-2021

@Arturo Amador - Would you be happy to mark your answer as best if the issue has been resolved by what you found? That will help others find your answer more quickly in the future.

Anonymous · ‎12-15-2021

Hi, @Piper Wilson ,

it is actually @hamzatazib96 that needs to mark the answer as best 🙂

Anonymous · ‎12-15-2021

WHOOPS! Thank you, @Arturo Amador!

@hamzatazib96 - If any of the answers solved the issue, would you be happy to mark it as best?

hamzatazib96 · ‎12-15-2021

Done! Thanks all for the answers and help!

Best way I found around this was to simply do an SCP transfer using the databricks exe from DBFS to an S3 bucket. The flow was:

DBFS -> EC2 Local -> S3 bucket

farazanwar · ‎04-29-2023

I am getting the same error I have mounted azure data lake and can see the files but when writing the csv file it gives error for context

Strange thing is that this works for other times

so16 · ‎07-19-2023

Please guys I need your help, I got the same issue still after readed all your comments.
I am using Databricks-connect(version 13.1) on pycharm and trying to load file that are on the dbfs storage.

spark = DatabricksSession.builder.remote(
    host=host, token=token, cluster_id=c_id).getOrCreate()
path="dbfs:/mnt/storage/file.csv"
df = spark.read.format("csv").option("header", "true").load(path)

Give me the error:

pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = "INVALID_STATE: Unsupported 12.2.x-scala2.12 0611-073104-1kjepouv on Databricks Runtime Version. (requestId=8c278ab3-348a-4fa1-9797-6d58d571eeff)"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"INVALID_STATE: Unsupported 12.2.x-scala2.12 0611-073104-1kjepouv on Databricks Runtime Version. (requestId=8c278ab3-348a-4fa1-9797-6d58d571eeff)", grpc_status:9, created_time:"2023-07-19T19:52:47.881727713+00:00"}"