cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What is `read_files`?

ChristianRRL
Valued Contributor III

Bit of a silly question, but wondering if someone can help me better understand what is `read_files`?

There's at least 3 ways to pull raw json data into a spark dataframe:

  • df = spark.read...
  • df = spark.readStream... (i.e. AutoLoader)
  • select * from read_files(...)

I'm curious, is read_files a Databricks SQL specific function, or is it native to Spark? Particularly, I'm curious about the `schemaHints` functionality that both AutoLoader & read_files support, but spark.read seemingly does not support (as far as I can tell).

1 ACCEPTED SOLUTION

Accepted Solutions

Isi
Honored Contributor III

Hello @ChristianRRL ,

No, read_files is not a native Spark function โ€” itโ€™s a Databricks SQL wrapper that allows you to read files easily using SQL syntax.

 

The main advantage is that it adds several Databricks-specific capabilities on top of Sparkโ€™s basic file reader, such as schema inference, schema hints, rescued data handling, and partition discovery.

For example:

SELECT * FROM read_files(
  's3://my-bucket/path/',
  format => 'json',
  schemaHints => 'user_id STRING, event_time TIMESTAMP'
);

is "equals" to:

spark.read.format("json").load("s3://my-bucket/path/")

 but with the extra Databricks logic for schema management and ingestion governance.


Regarding schemaHints, it works the same way as in Auto Loader โ€” it lets you override or enforce specific column types while leaving the rest of the schema inferred automatically. Docs

While spark.read in open-source Spark only allows you to fully define a schema or infer it entirely Docs, Databricks added schemaHints in this built-in function inside his DBR, you can override or enforce specific column types while letting the rest of the schema be inferred automatically.

Hope this helps, ๐Ÿ™‚

Isi

View solution in original post

2 REPLIES 2

Isi
Honored Contributor III

Hello @ChristianRRL ,

No, read_files is not a native Spark function โ€” itโ€™s a Databricks SQL wrapper that allows you to read files easily using SQL syntax.

 

The main advantage is that it adds several Databricks-specific capabilities on top of Sparkโ€™s basic file reader, such as schema inference, schema hints, rescued data handling, and partition discovery.

For example:

SELECT * FROM read_files(
  's3://my-bucket/path/',
  format => 'json',
  schemaHints => 'user_id STRING, event_time TIMESTAMP'
);

is "equals" to:

spark.read.format("json").load("s3://my-bucket/path/")

 but with the extra Databricks logic for schema management and ingestion governance.


Regarding schemaHints, it works the same way as in Auto Loader โ€” it lets you override or enforce specific column types while leaving the rest of the schema inferred automatically. Docs

While spark.read in open-source Spark only allows you to fully define a schema or infer it entirely Docs, Databricks added schemaHints in this built-in function inside his DBR, you can override or enforce specific column types while letting the rest of the schema be inferred automatically.

Hope this helps, ๐Ÿ™‚

Isi

BS_THE_ANALYST
Esteemed Contributor II

Also, @ChristianRRL , with a slight adjustment to the syntax, it does indeed behave like Autoloader
https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/patterns?language=SQL 

BS_THE_ANALYST_0-1759901883674.png

I'd also advise looking at the different options that Autoloader has when working with cloud storage i.e. Directory Listing Mode and File notification mode (recommended): https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes 

All the best,
BS

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now