cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Table on AWS Glue Catalog

Tam
New Contributor II

I have set up Databricks cluster to work with AWS Glue Catalog by enabling the spark.databricks.hive.metastore.glueCatalog.enabled to true. However, when I create a Delta table on Glue Catalog, the schema reflected in the AWS Glue Catalog is incorrect. the table schema only has one column named `col` of type `array<string>`.

Tam_0-1700157256870.png

Tam_1-1700157262740.png

Does anyone have any insights?

 

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @Tam , Certainly! Let’s dive into the intricacies of Delta tables in the AWS Glue Catalog.

 

Delta Lake and AWS Glue:

  • Delta Lake is an open source project that facilitates modern data lake architectures, often built on Amazon S3 or other cloud storage solutions.
  • With Delta Lake, you gain features like ACID transactions, time travel queries, and change data capture (CDC) for your data lake.
  • AWS Glue integrates seamlessly with Delta Lake, allowing you to work with Delta tables using the AWS Glue Data Catalog.

AWS Glue Delta Crawler:

  • AWS Glue includes a Delta crawler, which simplifies dataset discovery.
  • The Delta crawler scans the Delta Lake transaction logs in Amazon S3, extracts the schema, creates manifest files, and automatically populates the AWS Glue Data Catalog.
  • The newly created AWS Glue Data Catalog table has the format SymlinkTextInputFormat.
  • Previously, manifest files needed periodic regeneration to include newer transactions, resulting in I/O overhead and longer processing times.

Native Delta Lake Tables:

  • Good news! The Glue crawler now supports creating AWS Glue Data Catalog tables directly for native Delta Lake tables.
  • No more manual intervention or manifest file generation.
  • Schema evolution happens automatically, making newly ingested data quickly available for analysis with your preferred analytics and machine learning tools.

Querying Delta Lake Tables:

  • You can query native Delta Lake tables directly using:
    • Amazon Athena: It supports the Delta Lake native connector.
    • AWS Glue for Apache Spark: Available in Glue version 3.0 and later.
    • Amazon EMR: Supported in EMR release version 6.9.0 and later.

Best Practices:

  • Avoid using AWS Glue Crawler to define the table in AWS Glue for Delta Lake files. Instead, query the data files directly.
  • Delta Lake maintains files corresponding to multiple versions of the table, and querying all crawled files may yield incorrect results.

 

For more details, refer to the official AWS blog post.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.