cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unity Catalog : RDD Issue

shwetamagar
New Contributor II

In our existing notebooks, the scripts are reliant on RDDs. However, with the upgrade to Unity Catalog, RDDs will no longer be supported. We need to explore alternative approaches or tools to replace the use of RDDs. Could you suggest the best practices or migration strategies for this transition?

1 ACCEPTED SOLUTION

Accepted Solutions

Walter_C
Databricks Employee
Databricks Employee

To transition from using RDDs (Resilient Distributed Datasets) to alternative approaches supported by Unity Catalog, you can follow these best practices and migration strategies:

  1. Use DataFrame API: The DataFrame API is the recommended alternative to RDDs. It provides a higher-level abstraction for data processing and is optimized for performance. You can convert your existing RDD-based code to use DataFrames, which are supported in Unity Catalog.

  2. Replace RDD Operations:

    • For operations like sc.parallelize, use spark.createDataFrame with a list of dictionaries or Row objects.
    • For creating empty DataFrames, use spark.createDataFrame with an empty list and a defined schema.
    • For mapPartitions, rewrite the logic using DataFrame transformations and PySpark native Arrow UDFs.
  3. Avoid Spark Context and SQL Context: Unity Catalog does not support direct access to Spark Context (sc) and SQL Context (sqlContext). Use the spark variable to interact with the SparkSession instance instead.

  4. Use Volumes for File Access: Instead of using DBFS (Databricks File System) mount points, use Unity Catalog Volumes for file storage and access. This ensures that your data access is governed and secure.

  5. Update Cluster Configurations: Ensure that your clusters are running Databricks Runtime 13.3 or higher, and configure them to use shared or single-user access modes as appropriate for your workloads.

  6. Migrate Streaming Jobs: If you have streaming jobs that use RDDs, refactor them to use the Structured Streaming API. Ensure that checkpoint directories are moved to Volumes.

  7. Handle UDFs and Libraries: For user-defined functions (UDFs) and custom libraries, ensure they are compatible with the DataFrame API and Unity Catalog. Use cluster policies to manage library installations.

  8. Use the SYNC Command: For migrating tables from Hive/Glue to Unity Catalog, use the SYNC command to synchronize schema and table metadata.

  9. Upgrade Managed and External Tables: Use the upgrade wizard in Data Explorer to upgrade managed and external tables to Unity Catalog. For managed tables, consider using DEEP CLONE for Delta tables to preserve the delta log.

  10. Refactor Jobs and Notebooks: Evaluate and refactor your jobs and notebooks to ensure compatibility with Unity Catalog. This includes updating references to tables, paths, and configurations.

View solution in original post

1 REPLY 1

Walter_C
Databricks Employee
Databricks Employee

To transition from using RDDs (Resilient Distributed Datasets) to alternative approaches supported by Unity Catalog, you can follow these best practices and migration strategies:

  1. Use DataFrame API: The DataFrame API is the recommended alternative to RDDs. It provides a higher-level abstraction for data processing and is optimized for performance. You can convert your existing RDD-based code to use DataFrames, which are supported in Unity Catalog.

  2. Replace RDD Operations:

    • For operations like sc.parallelize, use spark.createDataFrame with a list of dictionaries or Row objects.
    • For creating empty DataFrames, use spark.createDataFrame with an empty list and a defined schema.
    • For mapPartitions, rewrite the logic using DataFrame transformations and PySpark native Arrow UDFs.
  3. Avoid Spark Context and SQL Context: Unity Catalog does not support direct access to Spark Context (sc) and SQL Context (sqlContext). Use the spark variable to interact with the SparkSession instance instead.

  4. Use Volumes for File Access: Instead of using DBFS (Databricks File System) mount points, use Unity Catalog Volumes for file storage and access. This ensures that your data access is governed and secure.

  5. Update Cluster Configurations: Ensure that your clusters are running Databricks Runtime 13.3 or higher, and configure them to use shared or single-user access modes as appropriate for your workloads.

  6. Migrate Streaming Jobs: If you have streaming jobs that use RDDs, refactor them to use the Structured Streaming API. Ensure that checkpoint directories are moved to Volumes.

  7. Handle UDFs and Libraries: For user-defined functions (UDFs) and custom libraries, ensure they are compatible with the DataFrame API and Unity Catalog. Use cluster policies to manage library installations.

  8. Use the SYNC Command: For migrating tables from Hive/Glue to Unity Catalog, use the SYNC command to synchronize schema and table metadata.

  9. Upgrade Managed and External Tables: Use the upgrade wizard in Data Explorer to upgrade managed and external tables to Unity Catalog. For managed tables, consider using DEEP CLONE for Delta tables to preserve the delta log.

  10. Refactor Jobs and Notebooks: Evaluate and refactor your jobs and notebooks to ensure compatibility with Unity Catalog. This includes updating references to tables, paths, and configurations.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group