cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Live Tables UDFs and Versions

NotARobot
New Contributor III

Trying to do a url_decode on a column, which works great in development, but running via DLT fails when trying multiple ways.

1. pyspark.sql.functions.url_decode - This is new as of 3.5.0, but isn't supported using whatever version running a DLT pipeline provides. I haven't been able to figure out what version of PySpark this is actually running. It says 12.2, but I suspect that might actually be the version of something else:
dlt:12.2-delta-pipelines-dlt-release-2024.04-rc0-commit-24b74

2. Attempt to use a simple UDF that wraps urllib.parse.unquote_plus, however this appears to be unsupported with Unit Catalog. Given the documentation states that this should be supported in versions greater than 13.1, again guessing the version is why I get this error:
pyspark.errors.exceptions.AnalysisException: [UC_COMMAND_NOT_SUPPORTED] UDF/UDAF functions are not supported in Unity Catalog

3. Have also tried to use cluster policies to attempt to set the version, however regardless of what version this attempts to force, cluster gets the same version as above. Have tried using regex, explicit version, and auto:latest with no luck.

This leads to two questions:
1. What version of PySpark is DLT running and how can users consistently find this to know what is available for use?
2. How do users force versions if cluster policies don't work?
3. Any other recommendations for doing a URL decode via DLT, since this is where the rest of our ETL pipeline is running, would prefer to not fragment out tables into separate workflows to manage.

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @NotARobot

  1. PySpark Version in DLT:

    • Determining the exact PySpark version running in your DLT environment is crucial for compatibility. Unfortunately, DLT doesn’t directly expose the PySpark version.
    • However, you can infer the PySpark version based on the DLT release version. In your case, the DLT release version is dlt:12.2-delta-pipelines-dlt-release-2024.04-rc0-commit-24b74.
    • The PySpark version is likely tied to this DLT release. You can check the compatibility matrix to find the corresponding PySpark version for this DLT release.
    • Additionally, you can explore the DLT documentation or contact your DLT support team to confirm the PySpark version.
  2. Forcing Versions via Cluster Policies:

    • While cluster policies are commonly used to set configurations, they might not directly control the PySpark version.
    • Instead, consider using a custom environment (e.g., Conda environment) where you explicitly specify the compatible PySpark version.
    • Here are the steps:
      • Pick a compatible Delta Lake version (e.g., Delta Lake 1.2) and its corresponding PySpark version (e.g., PySpark 3.2).
      • Create a YAML file (e.g., mr-delta.yml) with the required dependencies, including PySpark and Delta Lake.
      • Use Conda to create an environment based on this YAML file:
        conda env create -f envs/mr-delta.yml
        
      • Activate the environment before running your DLT pipeline.
  3. URL Decoding via DLT:

    • Since the url_decode function from PySpark 3.5.0 isn’t available in your DLT environment, consider alternative approaches:
      • UDF (User-Defined Function):
        • Although UDFs are unsupported in Unity Catalog, you can use them outside of table or view function definitions during graph initialization.
        • Define a Python UDF that wraps urllib.parse.unquote_plus and apply it to your DataFrame.
      • Custom Python Transformation:
        • Write a custom Python transformation that performs URL decoding using standard Python libraries.
        • Apply this transformation within your DLT pipeline.
      • Preprocessing in Source Data:
        • If possible, perform URL decoding at the source data level before ingesting data into DLT.
        • This avoids fragmentation and keeps your ETL pipeline unified.
 

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @NotARobot

  1. PySpark Version in DLT:

    • Determining the exact PySpark version running in your DLT environment is crucial for compatibility. Unfortunately, DLT doesn’t directly expose the PySpark version.
    • However, you can infer the PySpark version based on the DLT release version. In your case, the DLT release version is dlt:12.2-delta-pipelines-dlt-release-2024.04-rc0-commit-24b74.
    • The PySpark version is likely tied to this DLT release. You can check the compatibility matrix to find the corresponding PySpark version for this DLT release.
    • Additionally, you can explore the DLT documentation or contact your DLT support team to confirm the PySpark version.
  2. Forcing Versions via Cluster Policies:

    • While cluster policies are commonly used to set configurations, they might not directly control the PySpark version.
    • Instead, consider using a custom environment (e.g., Conda environment) where you explicitly specify the compatible PySpark version.
    • Here are the steps:
      • Pick a compatible Delta Lake version (e.g., Delta Lake 1.2) and its corresponding PySpark version (e.g., PySpark 3.2).
      • Create a YAML file (e.g., mr-delta.yml) with the required dependencies, including PySpark and Delta Lake.
      • Use Conda to create an environment based on this YAML file:
        conda env create -f envs/mr-delta.yml
        
      • Activate the environment before running your DLT pipeline.
  3. URL Decoding via DLT:

    • Since the url_decode function from PySpark 3.5.0 isn’t available in your DLT environment, consider alternative approaches:
      • UDF (User-Defined Function):
        • Although UDFs are unsupported in Unity Catalog, you can use them outside of table or view function definitions during graph initialization.
        • Define a Python UDF that wraps urllib.parse.unquote_plus and apply it to your DataFrame.
      • Custom Python Transformation:
        • Write a custom Python transformation that performs URL decoding using standard Python libraries.
        • Apply this transformation within your DLT pipeline.
      • Preprocessing in Source Data:
        • If possible, perform URL decoding at the source data level before ingesting data into DLT.
        • This avoids fragmentation and keeps your ETL pipeline unified.
 

NotARobot
New Contributor III

Thanks @Kaniz, for reference if anybody finds this, the DLT release docs are here: https://docs.databricks.com/en/release-notes/delta-live-tables/index.html
This shows which versions are running for CURRENT and PREVIEW channels. In this case, was running on CURRENT channel (Spark 3.3.2), so PREVIEW channel (Spark 3.5.0) should work for the latest PySpark functions.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.