I've been working on obtaining DDL at the schema level in Hive Metastore within GCP-hosted Databricks. I've implemented a Python code that generates SQL files in the dbfs/temp directory. However, when running the code, I'm encountering a "file path not found" error. Strangely, the code functions correctly with AWS-hosted Databricks accounts. Could anyone shed light on why this discrepancy might exist for GCP?
Additionally, I attempted to retrieve the results using the user interface, but it didn't provide all the DDL due to UI limitations.
Are there any potential workarounds or suggestions for addressing this issue?
Here is the Python code
# set your catalog name
catalog = "your_catalog_name"
# there should be a comma-separated list of schemas or single schema name
schemas = "schema_name".split(",")
spark.catalog.setCurrentCatalog(catalog)
# prepare file
folder_for_script = "/tmp/"
# creating a folder if it does not exist
dbutils.fs.mkdirs(folder_for_script)
file_path = "{}{}_ddl.sql".format(folder_for_script, catalog)
# creating and opening a file for writing
f = open("/dbfs"+file_path, "w")
f.truncate()
for schema in schemas:
allTables = spark.catalog.listTables(schema)
f.write("-- {}".format(schema))
f.write("\n")
for t in allTables:
# skip temporary tables
if t.isTemporary != True:
try:
ddl = spark.sql("SHOW CREATE TABLE {}.{};".format(schema, t.name))
f.write(ddl.first()[0]+";")
f.write("\n")
except Exception as error:
f.write("\n --- START ERROR --- \n /*\n")
f.write("name: {}.{},\ntableType: {} \n".format(t.namespace, t.name, t.tableType))
f.write("Unknown exception: {}".format(error))
f.write("*/\n --- END ERROR --- \n")
f.close()
# console output
script = spark.sparkContext.textFile(file_path)
file_data = script.collect()
for line in file_data:
print(line)
I attempted to retrieve the DDL through the user interface (UI), but it didn't fetch the complete DDL. Many tables were missing due to limitations within the UI. Are there any potential workarounds for this issue?
Thanks