cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Querying Unity Managed Tables from Redshift

stevenayers-bge
Contributor

I built a script about 6 months ago to make our Delta Tables accessible in Redshift for another team, but it's a bit nasty...

  1. Generate a delta lake manifest each time the databricks delta table is updated
  2. Recreate the redshift external table (incase the schema has changed or table recreated with a different location) by dynamically generating the DDL pointing to '<pathToDeltaTable>/_symlink_format_manifest/' . [code 1]
  3. If the delta table is partitioned, generate a list of the delta partitions and register each partition in redshift. (ALTER TABLE <tblname> ADD PARTITION IF NOT EXISTS (<partitioncol>='<colvalue>') LOCATION '<pathToDeltaTable>/_symlink_format_manifest/<partitioncol>=<colvalue>')

 

It's fine and it works, but is there a better way of doing this? Like integration between Unity Catalog and Redshift Spectrum that wouldn't require dynamically generating DDLs and adding partitions like this?

2 REPLIES 2

aashish122
New Contributor III

Still searching for the same pain point...may be in marketplace to integrate Unity Catalog and Redshift

mark_ott
Databricks Employee
Databricks Employee

There is indeed a better and more integrated way to make Delta Lake tables accessible in Redshift without manually generating manifests and dynamically creating external tables or partitions. Some important points and options:

  • Databricks Delta Lake supports an automatic manifest generation mode by enabling the table property delta.compatibility.symlinkFormatManifest.enabled. This auto-updates the manifest files on each table update, so you don't have to manually generate them.​

  • Redshift Spectrum can query external tables on S3 and supports partitioning, but requires manifest files for Delta Lake tables as a snapshot of the data. Using the symlink format manifest allows consistent snapshot querying.​

  • Your current approach of recreating external tables and dynamically adding partitions can be simplified by making use of the automatic manifest generation and defining external tables only once to point to the manifest directory (e.g., <pathToDeltaTable>/_symlink_format_manifest). Then partitions can be automatically handled by Redshift Spectrum if Glue Catalog or other metadata stores are used properly.​

  • Unity Catalog itself does not natively integrate with Redshift Spectrum to automatically sync tables or partitions without manual DDL operations, but you can use AWS Glue Catalog as a metadata store which integrates well with Redshift Spectrum to automate partition detection and table schema management.​

  • AWS Glue Crawlers can automate the detection of partitions and schema updates for S3 data, including Delta Lake manifests, thus reducing or eliminating the need to manually alter tables in Redshift Spectrum.​

Summary:

The best practice is to enable automatic manifest updates on the Delta table, register the Delta table location with Redshift Spectrum external tables once (pointing to the _symlink_format_manifest), and use AWS Glue Catalog with crawlers to automate partition and schema management. This avoids the need to dynamically generate DDL and add partitions manually. Full native integration between Unity Catalog and Redshift Spectrum for this purpose is not available yet.

If desired, you can also explore other approaches like partition projection in Redshift as a metadata-free alternative, but metadata-driven approach with Glue Catalog is more robust.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now