Hi @NathanC0926
Ingesting Excel files with streaming tables requires a combination of Databricks Autoloader
(for file discovery and exactly-once processing) and a custom UDF for Excel parsing.
Here's the native approach
Key Features of This Solution
1. Exactly-Once Processing
-- Autoloader automatically handles deduplication
-- Uses checkpointing to ensure files are processed exactly once
-- Tracks processed files in the schema location
2. Auto-Discovery
-- Autoloader continuously monitors the specified path
-- Automatically picks up new Excel files as they arrive
-- Supports glob patterns for file filtering
3. Native Integration
-- Uses Databricks' native Autoloader functionality
-- Integrates seamlessly with Unity Catalog
-- Supports Delta Live Tables (DLT) pattern
Alternative: Using Delta Live Tables
For a more declarative approach, use the DLT version provided in the code. It offers:
-- Built-in data quality monitoring
-- Automatic pipeline orchestration
-- Better integration with UC governance features
Performance Considerations
-- For Small Files: The UDF approach works well
-- For Large Files: Consider pre-processing Excel files to Parquet
-- Memory Management: Use read_only=True in openpyxl for large files
-- Concurrency: Autoloader handles parallelization automatically
This solution provides the native way to handle Excel files in streaming fashion while ensuring exactly-once processing
and auto-discovery of new files in Unity Catalog.
LR