cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Huge data migration from HDFS to Databricks

Phani1
Valued Contributor

Hi Team,

Could you please help me what is the best way/best practices to copy around 3 TB of data(parquet) from HDFS to Databricks delta format and create external tables on top of it?

Regards,

Phanindra

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @Phani1, To efficiently copy around 3 TB of Parquet data from HDFS to Databricks Delta format and create external tables, you can follow these best practices:

  1. Use the COPY INTO SQL Command:

    • The COPY INTO SQL command allows you to load data from a file location into a Delta table. It offers several capabilities:

      • Easily configurable file or directory filters from cloud storage (S3, ADLS Gen2, ABFS, GCS, and Unity Catalog volumes).
      • Support for multiple source file formats: CSV, JSON, XML, Avro, ORC, Parquet, text, and binary files.
      • Exactly-once (idempotent) file processing by default.
      • Target table schema inference, mapping, merging, and evolution.
    • Example of loading data into a schemaless Delta Lake table:

      CREATE TABLE IF NOT EXISTS my_table
      [COMMENT <table-description>]
      [TBLPROPERTIES (<table-properties>)];
      
      COPY INTO my_table FROM '/path/to/files'
      FILEFORMAT = <format>
      FORMAT_OPTIONS ('mergeSchema' = 'true')
      COPY_OPTIONS ('mergeSchema' = 'true');
      

      Note that the empty Delta table created above is not usable outside of COPY INTO. After data ingestion, the table becomes queryable1.

  2. Consider Streaming Tables:

    • For a more scalable and robust file ingestion experience, consider leveraging streaming tables in Databricks SQL. Streaming tables provide continuous data ingestion and are well-suited for large-scale data loads.
  3. Set Up Data Access Configuration:

    • An account admin should configure data access for ingestion to allow users to load data using COPY INTO.
  4. Schema Evolution:

    • If your Parquet data evolves over time (e.g., new columns are added), set mergeSchema to true to handle schema changes during ingestion.
  5. Checkpoint Location:

    • Specify a checkpoint location for the streaming job to ensure fault tolerance and recovery in case of failures.
  6. Monitor and Optimize:

    • Monitor the ingestion process, track performance metrics, and optimize resource allocation (e.g., cluster size, parallelism) based on your workload.
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!