cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

parameterized ALTER TABLE SET TAGS

JameDavi_51481
Contributor

I would like to use parametrized sql queries to run SET TAGs commands on tables, but cannot figure out how to parameterize the query to prevent SQL injection. Both the `?` and `:key` parameter syntaxes throw a syntax error

Basically, I'd like to do this, but without string formatting so I can protect against SQL injection. Is there a way to do this?

 

tags = ", ".join(list_of_tags)
spark.sql(
    f"ALTER TABLE IDENTIFIER(:table) SET TAGS {tags}",
    args={"table": table_name},
)

Note: the above isn't even valid syntax - once I add the IDENTIFIER clause, ANY FORM of SET TAGS following it will fail.

 

5 REPLIES 5

Advika
Databricks Employee
Databricks Employee

Hello @JameDavi_51481!

PySpark does not fully support parameterised queries for ALTER TABLE SET TAGS (a DDL Command). You can use IDENTIFIER(:table) to safely pass the table name, but for the tags, placeholders like ? or :key doesn’t seem to work. From what I understand, you’d need to manually format the tags before including them in the query. A safer way might be to validate them against a predefined list and then format them properly as a comma separated string. While it’s not fully parameterised, this should help mitigate SQL injection risks.

JameDavi_51481
Contributor

> You can use IDENTIFIER(:table) to safely pass the table name, but for the tags, placeholders like ? or :key doesn’t seem to work.

This does not seem correct in my testing. This fails on a cluster running DBR 16.1:

spark.sql("alter table IDENTIFIER(:table) SET TAGS ('foo')",
args={"table": <table_name>})

  File "/databricks/python/lib/python3.12/site-packages/grpc/_channel.py", line 966, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "
[PARSE_SYNTAX_ERROR] Syntax error at or near 'TAGS'. SQLSTATE: 42601 (line 1, pos 35)
 

Data_Mavericks
New Contributor II

Hello @JameDavi_51481 and @Advika 

1st thing we need to understand what is SQL Injection and why its not possible ?

A security hole known as SQL Injection occurs when a hacker inserts malicious SQL code into a query. Unauthorized access to databases, data alternation, ore even the removal of entire tables may result from this. Lets consider this example: SELECT * FROM users WHERE username = 'admin' or '1'='1'
Now condition 1=1 is always true,  so i can easily bypass authentication and return all user data. 

@Advika  mentioned the reason is exactly correct but you can try following 2 options:
1st approach : Using python you can prevent SQL Injection by sanitizing inputs and safely craft the query. 
example: 

def set_table_tags(spark, table_name, tags_dict):
"""
Safely sets tags on a given table in Databricks.

:param spark: The Spark session
:param table_name: The name of the table (string)
:param tags_dict: A dictionary of tag key-value pairs
"""
# Sanitize table name (remove any backticks that might be used maliciously)
safe_table_name = f"`{table_name.replace('`', '')}`"

# Format tags safely as key = 'value'
tag_assignments = ", ".join([f"'{key}' = '{value}'" for key, value in tags_dict.items()])

# Construct the SQL query
query = f"ALTER TABLE {safe_table_name} SET TAGS {tag_assignments}"

# Execute the query
spark.sql(query)
print(f"Tags successfully set for table {table_name}")

# Example usage:
tags = {"environment": "production", "owner": "data_engineer"}
set_table_tags(spark, "my_catalog.my_schema.my_table", tags)

2nd approch is more simple by creating the widgets like this 

-- Create input widgets
CREATE WIDGET TEXT table_name DEFAULT "my_catalog.my_schema.my_table";
CREATE WIDGET TEXT tag_key DEFAULT "owner";
CREATE WIDGET TEXT tag_value DEFAULT "data_engineer";

-- Construct and execute the query
SET tag_query = CONCAT(
'ALTER TABLE ', getArgument('table_name'),
' SET TAGS ', getArgument('tag_key'), ' = ', "'", getArgument('tag_value'), "'"
);
EXECUTE IMMEDIATE tag_query;

 

harshit

JameDavi_51481
Contributor

Homegrown input text sanitization within Python is notoriously failure prone and typically easily circumvented, and not an approach I would ever recommend.

Data_Mavericks
New Contributor II

@JameDavi_51481  i agree with your point so creating dynamic safe queries in python can i think still if i am not wrong? i think can use EXECUTE IMMEDIATE this can prevent risk of exposing  direct string formatting.

harshit

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now