Hello community,
I was writing this piece of code to do the data migration between two catalogs:
# Read data and partitioning
print(f"Loading {table_name} from production catalog...")
prod_df_table_name = f"prod_catalog.`00_bronze_layer`.pg_{table_name}"
prod_df_table = spark.read.table(prod_df_table_name).persist(StorageLevel.MEMORY_AND_DISK)
# Write with optimizations
print(f"Saving {table_name} to Bronze Schema")
stg_df_table_name = f"stg_catalog.`00_bronze_layer`.pg_{table_name}"
if partition_col in ["all", "none", "export_run_timestamp", "updated_at", "received_at"]:
prod_df_table.write \
.format("delta") \
.mode("overwrite") \
.option("mergeSchema", "true") \
.saveAsTable(stg_df_table_name)
else:
prod_df_table.write \
.format("delta") \
.mode("overwrite") \
.partitionBy(partition_col) \
.option("mergeSchema", "true") \
.saveAsTable(stg_df_table_name)
prod_df_table.unpersist()
I also using a little cluster with 14 gb and 4 cores without any worker node, any suggestion to improve the speed of copy :(? I'm going to work with tables that have sometimes 10 gigabytes of data