Hi @Phani1, To efficiently copy around 3 TB of Parquet data from HDFS to Databricks Delta format and create external tables, you can follow these best practices:
-
Use the COPY INTO SQL Command:
-
The COPY INTO
SQL command allows you to load data from a file location into a Delta table. It offers several capabilities:
- Easily configurable file or directory filters from cloud storage (S3, ADLS Gen2, ABFS, GCS, and Unity Catalog volumes).
- Support for multiple source file formats: CSV, JSON, XML, Avro, ORC, Parquet, text, and binary files.
- Exactly-once (idempotent) file processing by default.
- Target table schema inference, mapping, merging, and evolution.
-
Example of loading data into a schemaless Delta Lake table:
CREATE TABLE IF NOT EXISTS my_table
[COMMENT <table-description>]
[TBLPROPERTIES (<table-properties>)];
COPY INTO my_table FROM '/path/to/files'
FILEFORMAT = <format>
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
Note that the empty Delta table created above is not usable outside of COPY INTO
. After data ingestion, the table becomes queryable1.
-
Consider Streaming Tables:
- For a more scalable and robust file ingestion experience, consider leveraging streaming tables in Databricks SQL. Streaming tables provide continuous data ingestion and are well-suited for large-scale data loads.
-
Set Up Data Access Configuration:
- An account admin should configure data access for ingestion to allow users to load data using
COPY INTO
.
-
Schema Evolution:
- If your Parquet data evolves over time (e.g., new columns are added), set
mergeSchema
to true
to handle schema changes during ingestion.
-
Checkpoint Location:
- Specify a checkpoint location for the streaming job to ensure fault tolerance and recovery in case of failures.
-
Monitor and Optimize:
- Monitor the ingestion process, track performance metrics, and optimize resource allocation (e.g., cluster size, parallelism) based on your workload.