cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

External connection to hive metastore

Pilsner
Contributor III

Hello,

We are currently having an issue writing tables to the hive metastore. I've tried to outline what we know/have tried so far, below:

Known situation:

  • We have a databricks environment but it does not currently use Unity Catalogue. We instead have an Azure data lake, which they can access through the hive_metatore "Legacy" section. 

 

  • When I select a schema, I can see the location begins with "abfss".

 

  • If I'm in the databricks UI, I can simply navigate to the schema I want to write to and then select create. This then allows me to create a table by uploading a file such as a csv.

 

  • Once I have done this, I can see that the table type is managed.

 

  • In Alteryx, I've have been able to setup a connection (using the Simba spark ODBC driver) that allows us to read data from the hive_metastore.

 

 

Issue:
We want to be able to write out data from Alteryx, to the hive metastore / the underlying azure storage but have not yet been able to do so.

 

What we have tried:

 

On our system, we have tried many options but we believe the closest to success was the when we did the following:

In the ODBC DSN we set up the driver with the username and password (access token), server host name and http path ect.

In alteryx, we then selected this and also provided information found in our Azure environment

 

  • Storage account name
  • Container name
  • Temp directory
  • Shared key

 

When we tried to create a table using this method, in the databricks UI we could see that the table name was created, but none of the data was there

 

From the azure side, we could also see that in the temp directory, it had created a folder for this new table.

Looking in the query history, the issue seems to relate to the Copy into command.

We are unsure what to do from this point onwards, so if you could help us get this connected, that would be greatly appreciated.

Please note: whilst we are trying to connect via Alteryx (which simply uses an ODBC connection), a Python solution would still be helpful as we can deploy this via Alteryx.

1 REPLY 1

BigRoux
Databricks Employee
Databricks Employee

Here are some recommendations and tips/tricks:

 

 

1. Understanding the Architecture & Common Issues

In legacy Databricks setups, the Hive Metastore is used to manage tables and their metadata. When writing via ODBC (using the Simba Spark driver) or an engine like Alteryx, the workflow involves:
  • Authenticating and connecting via ODBC or JDBC to Databricks clusters.
  • Initiating table creation or data insert operations, which must land in the correct abfss-backed storage.
  • For managed tables, the Hive Metastore chooses the storage location (often under abfss).
Typical challenges include: - Table metadata creates successfully, but data files are missing. - Temporary directories are created as expected, but COPY INTO or data load steps fail. - Permission or credential issues on Azure Data Lake Storage. - Misconfiguration in the ODBC DSN or connection properties. - Schema mismatches or format errors with source data.

2. ODBC (Simba Spark Driver) Configuration and Permissions

Key configuration points: - Server Hostname (found in cluster connection details). - HTTP Path (from the Databricks cluster or warehouse). - Access token for authentication (or, optionally, Azure AD/Managed Identity auth). - For storage: Storage account name, container, possibly a temp directory, and permissions.
 
Troubleshooting Steps: - Ensure the cluster attached to the ODBC session is running and healthy. - Double-check that the DSN/X connection string matches the Databricks workspace details. - If using a temp directory and shared key for storage, validate that: - The service principal or identity used for the connection has both Storage Blob Data Contributor permissions at least at the container level (ideally at the account level) for both temp directory and the target table location in abfss. - If writing operations create tables with missing data: - Review Databricks query history for errors with COPY INTO—the error message often discloses whether the issue is with credential passthrough, format mismatch, schema mismatch, or destination path. - In Azure, check whether the temp directory and table data root both exist and are accessible. - Check for explicit errors such as COPY_INTO_SCHEMA_MISMATCH, missing storage credentials, or permission denied. For common error messages, refer to Microsoft’s documentation (“Error conditions in Azure Databricks”) which describes issues like COPY_INTO_CREDENTIALS_REQUIRED or HIVE_METASTORE_TABLE_PLACEHOLDER_PATH_NOT_SET.
 
Best Practices: - Using the “abfss://” URI ensures the table lands in Azure Data Lake Gen2 with a hierarchical namespace (required). - For ODBC, prefer OAuth2 token-based authentication over direct key sharing when possible, for security and compatibility. - Review and use the latest Simba Spark ODBC driver and corresponding configuration guides, especially if you want to switch from password/key to service principal or managed identity auth.

3. Permissions Checklist

To write data or create managed tables in the Hive Metastore: - The writing identity (user or service principal) needs at least CREATE and MODIFY privileges on the schema, and sufficient rights to write to the abfss storage used for managed tables. - In Azure Portal, assign “Storage Blob Data Contributor” at the container or account level for the identity used. - If you use “Storage Blob Data Contributor” at the container (not account) level, also grant “Storage Blob Delegator” at the storage account level. - Confirm no network/firewall rules are blocking Databricks access to the storage account.
Common error symptoms: - Data files or folders created in temp but not promoted: Check permission escalation and temp-to-final moves. - Error messages naming permission denied, credential required, or missing path.

4. Python Workaround via PySpark (as a fallback)

If Alteryx/ODBC write paths continue to fail and you are able to run Python code (either from Databricks notebooks or via Alteryx’s Python tools), the following approach is recommended:
  • Use PySpark to read your CSV/data and write to the Hive Metastore managed table location: python df = spark.read.csv('/dbfs/FileStore/tmp/yourfile.csv', header=True) df.write.saveAsTable('hive_metastore.schema.table_name') This writes to the managed location defined for the schema, and will trigger the Hive Metastore to update accordingly.
  • For external tables, you can specify a fully-qualified abfss path. For managed tables, do not specify a path—let the metastore choose.
  • After writing, confirm with DESCRIBE FORMATTED schema.table_name in a notebook, and validate in the Databricks UI that the table is “managed” and data lands as expected.

5. Next Steps and Recommendations

  1. Audit Permissions
    • Double-check the permissions/roles assigned to the identity that runs the ODBC/Alteryx or Python job. Be sure it can write to both the temp and final abfss locations.
  2. Review ODBC Write Implementations
    • Some BI/ETL tools (including Alteryx) may not fully implement the COPY INTO or data staging mechanisms required by Databricks via ODBC. If the query history in Databricks indicates a failed or incomplete “COPY INTO,” investigate the format (column alignment, data types, etc).
  3. Check Table Types
    • Ensure you’re not mixing managed and external table semantics by mistake. In legacy setups, for managed tables, do not override location.
  4. Temp Directory Ownership
    • If temp data lands, but does not move: check if Databricks’ process has permission to rename/move data from temp to final.
  5. Fallback to PySpark
    • If ODBC/Alteryx workflows are not writing successfully after addressing permissions and config, execute the same data ingestion via Databricks notebooks or jobs using PySpark as above. This will use Databricks-native APIs for table creation and is the most supported route for managed table writes.

Summary Table: Troubleshooting and Best Practices

Area Checks/Actions
Storage Permissions "Storage Blob Data Contributor" on storage account/container for writing identity; network access configured.
ODBC Configuration Use latest Simba Spark driver, correct Server Hostname & HTTP Path, prefer OAuth2/Managed Identity.
Temp Directory Temp data moves to final only if both temp and target are accessible & permitted.
Table Type Managed table: don’t specify LOCATION; let metastore assign. External table: specify explicit abfss path.
Error Logs Check Databricks query history, especially for COPY INTO failures.
Python Fallback Use df.write.saveAsTable('hive_metastore.schema.table') from PySpark as a robust ingestion method.
Metadata Sync After non-UI writes, run MSCK REPAIR TABLE to sync metadata if needed.

Final Notes

  • The legacy Hive Metastore approach is fully supported for both reads and writes, but ongoing development is focused on Unity Catalog—consider a migration plan in the long term.
  • When troubleshooting, always confirm if the permissions mismatch, missing credentials, or configuration errors are present by reading the detailed error messages—these usually pinpoint root causes.
  • Each step (permission audits, config validation, reviewing query logs/history, trying PySpark alternatives) is critical for isolating where the write workflow breaks down.

Hope this helps, Louis.