cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How do I read the contents of a hidden file in a Spark job?

Lincoln_Bergeso
New Contributor II

I'm trying to read a file from a Google Cloud Storage bucket. The filename starts with a period, so Spark assumes the file is hidden and won't let me read it.

My code is similar to this:

from pyspark.sql import SparkSession
 
spark = SparkSession.builder.getOrCreate()
df = spark.read.format("text").load("gs://<bucket>/.myfile", wholetext=True)
df.show()

The resulting DataFrame is empty (as in, it has no rows).

When I run this on my laptop, I get the following error message:

22/02/15 16:40:58 WARN DataSource: All paths were ignored:
  gs://<bucket>/.myfile

I've noticed that this applies to files starting with an underscore as well.

How can I get around this?

1 ACCEPTED SOLUTION

Accepted Solutions

Dan_Z
Honored Contributor
Honored Contributor

I don't think there is an easy way to do this. You will also break very basic functionality (like being able to read Delta tables) if you were able to get around these constraints. I suggest you employ a rename job and then read.

View solution in original post

10 REPLIES 10

-werners-
Esteemed Contributor III

Spark uses the Hadoop Input API to read files, which ignores every file that starts with an underscore or a period.

I did not find a solution for this as the hiddenFileFilter is always active.

Is there any way to work around this?

Hi @Lincoln Bergeson​ , Spark uses Hadoop APIs to read in data from HDFS. Hadoop input formats have a path filter to filter out files starting from "_" and "." Try setting this property, FileInputFormat.setInputPathFilter in your configuration and then use newAPIHadoopFile to create the RDD.

Anonymous
Not applicable

Hi there, @Lincoln Bergeson​! My name is Piper, and I'm a moderator for Databricks. Thank you for your question and welcome to the community. We'll give your peers a chance to respond and then we'll circle back if we need to.

Thanks in advance for your patience. 🙂

Looking forward to the answers. From my research this looks something that needs a special configuration or work-around, which I'm hoping Databricks can provide.

Atanu
Esteemed Contributor
Esteemed Contributor

@Lincoln Bergeson​  GCS object names are very liberal. Only \r and \n are invalid, everything else is valid, including the NUL character. I am still not sure if this can help you. We do really need to hack this from spark side!

Hi @Lincoln Bergeson​ ,

Just a friendly follow-up. Did any of the previous responses help you to resolve your issue? Please let us know if you still need help.

Hi @Jose Gonzalez​ , none of these answers helped me, unfortunately. I'm still hoping to find a good solution to this issue.

Dan_Z
Honored Contributor
Honored Contributor

I don't think there is an easy way to do this. You will also break very basic functionality (like being able to read Delta tables) if you were able to get around these constraints. I suggest you employ a rename job and then read.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Lincoln Bergeson​ , Did @Dan Zafar​ 's response help you solve your problem?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group