cloning the data between two catalogs

jeremy98 — Thu, 28 Nov 2024 11:17:55 GMT

Hello community,

I was writing this piece of code to do the data migration between two catalogs:

# Read data and partitioning print(f"Loading {table_name} from production catalog...") prod_df_table_name = f"prod_catalog.`00_bronze_layer`.pg_{table_name}" prod_df_table = spark.read.table(prod_df_table_name).persist(StorageLevel.MEMORY_AND_DISK) # Write with optimizations print(f"Saving {table_name} to Bronze Schema") stg_df_table_name = f"stg_catalog.`00_bronze_layer`.pg_{table_name}" if partition_col in ["all", "none", "export_run_timestamp", "updated_at", "received_at"]: prod_df_table.write \ .format("delta") \ .mode("overwrite") \ .option("mergeSchema", "true") \ .saveAsTable(stg_df_table_name) else: prod_df_table.write \ .format("delta") \ .mode("overwrite") \ .partitionBy(partition_col) \ .option("mergeSchema", "true") \ .saveAsTable(stg_df_table_name) prod_df_table.unpersist()

I also using a little cluster with 14 gb and 4 cores without any worker node, any suggestion to improve the speed of copy :(? I'm going to work with tables that have sometimes 10 gigabytes of data

Re: cloning the data between two catalogs

jeremy98 — Thu, 28 Nov 2024 13:47:15 GMT

FYI,
I did it increasing the size of the cluster using much cores and directly written:

prod_df_table.write \

.format("delta") \

.mode("overwrite") \

.saveAsTable(stg_df_table_name)

topic cloning the data between two catalogs in Data Engineering

cloning the data between two catalogs

Re: cloning the data between two catalogs