cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

read_files vs. cloud_file

Ivaylo
New Contributor II

I was wondering what is the difference between read_files and cloud_file.

I can't find explicit explanation or comparison in Databriks Documentation.

Best Regards 

Ivaylo

1 ACCEPTED SOLUTION

Accepted Solutions

nayan_wylde
Honored Contributor II

## Key Differences Between `read_files` and `cloud_files`

### **`read_files` Function**

`read_files` is a table-valued function that reads files under a provided location and returns the data in tabular form. It supports reading JSON, CSV, XML, TEXT, BINARYFILE, PARQUET, AVRO, and ORC file formats and can detect the file format automatically and infer a unified schema across all files.

**Key characteristics:**

- **Batch Processing**: Primarily designed for batch reading of files
- **SQL Function**: Can be used directly in SQL queries
- **Schema Inference**: Automatically infers unified schema across files
- **Format Detection**: Auto-detects file formats
- **Streaming Support**: Can be used in streaming tables to ingest files into Delta Lake and leverages Auto Loader when used in a streaming table query

### **`cloud_files` (Auto Loader)**

When you use Auto Loader you get better performance, scalability and cost efficiency when discovering new files. You can either use file notification mode (when you get notified about new files using cloud-native integration) or optimized file listing mode that uses native cloud APIs to list files and directories.

**Key characteristics:**

- **Streaming-First**: Designed primarily for streaming/incremental data ingestion
- **Performance Optimized**: Better performance for discovering new files
- **Cloud-Native Integration**: Uses cloud storage notifications for real-time file discovery
- **Auto Loader Technology**: Full-featured incremental data ingestion solution

## Detailed Comparison

### **Performance & Scalability**

- **`read_files`**: Good for batch operations, but less optimized for continuous file discovery
- **`cloud_files`**: Better performance, scalability and cost efficiency when discovering new files

### **File Discovery Methods**

- **`read_files`**: Uses traditional file listing approaches
- **`cloud_files`**: Uses file notification mode (cloud-native integration) or optimized file listing mode with native cloud APIs

View solution in original post

1 REPLY 1

nayan_wylde
Honored Contributor II

## Key Differences Between `read_files` and `cloud_files`

### **`read_files` Function**

`read_files` is a table-valued function that reads files under a provided location and returns the data in tabular form. It supports reading JSON, CSV, XML, TEXT, BINARYFILE, PARQUET, AVRO, and ORC file formats and can detect the file format automatically and infer a unified schema across all files.

**Key characteristics:**

- **Batch Processing**: Primarily designed for batch reading of files
- **SQL Function**: Can be used directly in SQL queries
- **Schema Inference**: Automatically infers unified schema across files
- **Format Detection**: Auto-detects file formats
- **Streaming Support**: Can be used in streaming tables to ingest files into Delta Lake and leverages Auto Loader when used in a streaming table query

### **`cloud_files` (Auto Loader)**

When you use Auto Loader you get better performance, scalability and cost efficiency when discovering new files. You can either use file notification mode (when you get notified about new files using cloud-native integration) or optimized file listing mode that uses native cloud APIs to list files and directories.

**Key characteristics:**

- **Streaming-First**: Designed primarily for streaming/incremental data ingestion
- **Performance Optimized**: Better performance for discovering new files
- **Cloud-Native Integration**: Uses cloud storage notifications for real-time file discovery
- **Auto Loader Technology**: Full-featured incremental data ingestion solution

## Detailed Comparison

### **Performance & Scalability**

- **`read_files`**: Good for batch operations, but less optimized for continuous file discovery
- **`cloud_files`**: Better performance, scalability and cost efficiency when discovering new files

### **File Discovery Methods**

- **`read_files`**: Uses traditional file listing approaches
- **`cloud_files`**: Uses file notification mode (cloud-native integration) or optimized file listing mode with native cloud APIs

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now