Databricks Community

jeremy98 · ‎11-28-2024

Hello community,

I was writing this piece of code to do the data migration between two catalogs:

        # Read data and partitioning
        print(f"Loading {table_name} from production catalog...")
        prod_df_table_name = f"prod_catalog.`00_bronze_layer`.pg_{table_name}"
        prod_df_table = spark.read.table(prod_df_table_name).persist(StorageLevel.MEMORY_AND_DISK)

        # Write with optimizations
        print(f"Saving {table_name} to Bronze Schema")
        stg_df_table_name = f"stg_catalog.`00_bronze_layer`.pg_{table_name}"
        if partition_col in ["all", "none", "export_run_timestamp", "updated_at", "received_at"]:
            prod_df_table.write \
                .format("delta") \
                .mode("overwrite") \
                .option("mergeSchema", "true") \
                .saveAsTable(stg_df_table_name)
        else:
            prod_df_table.write \
                .format("delta") \
                .mode("overwrite") \
                .partitionBy(partition_col) \
                .option("mergeSchema", "true") \
                .saveAsTable(stg_df_table_name)

        prod_df_table.unpersist()

I also using a little cluster with 14 gb and 4 cores without any worker node, any suggestion to improve the speed of copy :(? I'm going to work with tables that have sometimes 10 gigabytes of data