05-31-2023 03:29 AM
Hi There,
I have been trying to create an external table on Azure Databricks with below statement.
df.write.partitionBy("year", "month", "day").format('org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat').option("path",sourcepath).mode("overwrite").saveAsTable('test.table_test')
sourcepath above is the path on ADLS storage account.
the code works and the table is created pointing to the adlas storage account container, but I get following exception
analysisexception: org.apache.hadoop.hive.ql.metadata.hiveexception: metaexception(message:got exception: shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.contracts.exceptions.keyproviderexception failure to initialize configuration
I also tried setting up configs at cluster level as suggested on
I configured the cluster level configs as below
fs.azure.account.oauth2.client.endpoint.<STORAGE ACCOUNT>.dfs.core.windows.net https://login.microsoftonline.com/<key>/oauth2/token
fs.azure.account.auth.type.<STORAGE ACCOUNT>.dfs.core.windows.net OAuth
fs.azure.account.oauth2.client.id.<STORAGE ACCOUNT>.dfs.core.windows.net ClientId
fs.azure.account.oauth.provider.type.<STORAGE ACCOUNT>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.secret.<STORAGE ACCOUNT>.dfs.core.windows.net {{secrets/scope/Secret_Key_Name}}
But it still does not work and throws the same exception.
Can anyone who may have encountered and resolved this issue earlier guide me on if I am missing any configs above and what can be the reason I am getting above excpetion although the data is getting written to adls and a table is getting created pointing to the same data.
Thanks
06-03-2023 05:45 PM
You should just set up your external location with the gui tool in the data tab (bottom left where it says external locations). Youll have to setup a service principal and add a databricksconnector in your azure resource group. This is the easiest way so you dont have to change the spar.config.
06-05-2023 07:56 AM
Hi,
Thank you for taking time to answer my query.
I tried setting up external location under data tab on databricks workspace. Also, added databricks access connector on resource group being used by workspace.
But the exception remains the same.
Even after I specified the key for the storage account on cluster level advanced options.
The .save() works perfectly without any issues.. But .saveAsTable() gives the exception.
Thank you
06-05-2023 01:28 PM
Well the error is talking about the hive metastore so that means youre still not connecting to your external location.
Can you create a table on the external location with this (assuming you have a json file in a blob folder location)?
# Set Variables
catalog_name = test_catalog
schema_name = test_schema
catalog_container = <catalog container string>
storage_account = <storage account string>
blob_path = f"abfss://{catalog_container}@{storage_account}.dfs.core.windows.net/<folder path>"
# Create Catalog if it doesn't exist
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog_name} MANAGED LOCATION 'abfss://{catalog_container}@{storage_account}.dfs.core.windows.net/'")
# Create Schema if it doesn't exist
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog_name }.{schema_name }")
# Read Files into Dataframe
df = spark.read.format('json').load(blob_path)
df.display()
06-05-2023 05:09 AM
Setting up external location in the data tab is must. then only you can proceed accordingly.
06-05-2023 05:12 AM
The reason why the special characters are converted to '?????' is because the Hive Metastore stores data in a binary format. When you create a view, the Hive Metastore converts the data in the view to binary format. This conversion process strips out any non-ASCII characters.
To preserve the special characters in the view representation, you can use the following workaround:
The following code shows how to do this:
CREATE VIEW my_view AS
SELECT
*,
encode(string_attribute, 'utf-8') AS encoded_string_attribute
FROM my_table;
SELECT
*
FROM my_view
WHERE
encoded_string_attribute != 'シミュレータに接続されていません';
This code will create a new view called
my_view
that contains the original data from
my_table
plus a new column called
encoded_string_attribute
. The
encoded_string_attribute
column contains the encoded version of the
string_attribute
column. The where clause of the view will use the
encoded_string_attribute
column to filter out rows where the value of the
string_attribute
column is equal to 'シミュレータに接続されていません'.
This workaround will preserve the special characters in the view representation. However, it will also make the view slightly larger, because the encoded version of the
string_attribute
column will be slightly larger than the original
string_attribute
column.
If you want to avoid the performance penalty of storing the encoded version of the
string_attribute
column, you can use the following alternative workaround:
The following code shows how to do this:
CREATE VIEW my_view AS
SELECT
*,
escape(string_attribute) AS escaped_string_attribute
FROM my_table;
SELECT
*
FROM my_view
WHERE
escaped_string_attribute != 'シミュレータに接続されていません';
This code will create a new view called
my_view
that contains the original data from
my_table
plus a new column called
escaped_string_attribute
. The
escaped_string_attribute
column contains the escaped version of the
string_attribute
column. The where clause of the view will use the
escaped_string_attribute
column to filter out rows where the value of the
string_attribute
column is equal to 'シミュレータに接続されていません'.
This workaround will preserve the special characters in the view representation without making the view any larger. However, it will make the where clause of the view slightly slower, because the Hive Metastore has to do some extra work to escape the special characters.
Which workaround you choose will depend on your specific needs. If you need to preserve the special characters in the view representation and you need the view to be as fast as possible, then you should use the first workaround. If you need to preserve the special characters in the view representation but you don't need the view to be as fast as possible, then you should use the second workaround.
06-07-2023 12:46 AM
Hi @Gaurishankar Sakhare
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group