Delta Table on AWS Glue Catalog

New Contributor II

I have set up Databricks cluster to work with AWS Glue Catalog by enabling the spark.databricks.hive.metastore.glueCatalog.enabled to true. However, when I create a Delta table on Glue Catalog, the schema reflected in the AWS Glue Catalog is incorrect. the table schema only has one column named `col` of type `array<string>`.



Does anyone have any insights?



Community Manager
Community Manager

Hi @Tam , Certainly! Let’s dive into the intricacies of Delta tables in the AWS Glue Catalog.


Delta Lake and AWS Glue:

  • Delta Lake is an open source project that facilitates modern data lake architectures, often built on Amazon S3 or other cloud storage solutions.
  • With Delta Lake, you gain features like ACID transactions, time travel queries, and change data capture (CDC) for your data lake.
  • AWS Glue integrates seamlessly with Delta Lake, allowing you to work with Delta tables using the AWS Glue Data Catalog.

AWS Glue Delta Crawler:

  • AWS Glue includes a Delta crawler, which simplifies dataset discovery.
  • The Delta crawler scans the Delta Lake transaction logs in Amazon S3, extracts the schema, creates manifest files, and automatically populates the AWS Glue Data Catalog.
  • The newly created AWS Glue Data Catalog table has the format SymlinkTextInputFormat.
  • Previously, manifest files needed periodic regeneration to include newer transactions, resulting in I/O overhead and longer processing times.

Native Delta Lake Tables:

  • Good news! The Glue crawler now supports creating AWS Glue Data Catalog tables directly for native Delta Lake tables.
  • No more manual intervention or manifest file generation.
  • Schema evolution happens automatically, making newly ingested data quickly available for analysis with your preferred analytics and machine learning tools.

Querying Delta Lake Tables:

  • You can query native Delta Lake tables directly using:
    • Amazon Athena: It supports the Delta Lake native connector.
    • AWS Glue for Apache Spark: Available in Glue version 3.0 and later.
    • Amazon EMR: Supported in EMR release version 6.9.0 and later.

Best Practices:

  • Avoid using AWS Glue Crawler to define the table in AWS Glue for Delta Lake files. Instead, query the data files directly.
  • Delta Lake maintains files corresponding to multiple versions of the table, and querying all crawled files may yield incorrect results.


For more details, refer to the official AWS blog post.

