Databricks Community

shwetamagar · ‎01-14-2025

In our existing notebooks, the scripts are reliant on RDDs. However, with the upgrade to Unity Catalog, RDDs will no longer be supported. We need to explore alternative approaches or tools to replace the use of RDDs. Could you suggest the best practices or migration strategies for this transition?

Walter_C · ‎01-14-2025

To transition from using RDDs (Resilient Distributed Datasets) to alternative approaches supported by Unity Catalog, you can follow these best practices and migration strategies:

Use DataFrame API: The DataFrame API is the recommended alternative to RDDs. It provides a higher-level abstraction for data processing and is optimized for performance. You can convert your existing RDD-based code to use DataFrames, which are supported in Unity Catalog.
Replace RDD Operations:
- For operations like sc.parallelize, use spark.createDataFrame with a list of dictionaries or Row objects.
- For creating empty DataFrames, use spark.createDataFrame with an empty list and a defined schema.
- For mapPartitions, rewrite the logic using DataFrame transformations and PySpark native Arrow UDFs.
Avoid Spark Context and SQL Context: Unity Catalog does not support direct access to Spark Context (sc) and SQL Context (sqlContext). Use the spark variable to interact with the SparkSession instance instead.
Use Volumes for File Access: Instead of using DBFS (Databricks File System) mount points, use Unity Catalog Volumes for file storage and access. This ensures that your data access is governed and secure.
Update Cluster Configurations: Ensure that your clusters are running Databricks Runtime 13.3 or higher, and configure them to use shared or single-user access modes as appropriate for your workloads.
Migrate Streaming Jobs: If you have streaming jobs that use RDDs, refactor them to use the Structured Streaming API. Ensure that checkpoint directories are moved to Volumes.
Handle UDFs and Libraries: For user-defined functions (UDFs) and custom libraries, ensure they are compatible with the DataFrame API and Unity Catalog. Use cluster policies to manage library installations.
Use the SYNC Command: For migrating tables from Hive/Glue to Unity Catalog, use the SYNC command to synchronize schema and table metadata.
Upgrade Managed and External Tables: Use the upgrade wizard in Data Explorer to upgrade managed and external tables to Unity Catalog. For managed tables, consider using DEEP CLONE for Delta tables to preserve the delta log.
Refactor Jobs and Notebooks: Evaluate and refactor your jobs and notebooks to ensure compatibility with Unity Catalog. This includes updating references to tables, paths, and configurations.

View solution in original post

Walter_C · ‎01-14-2025

To transition from using RDDs (Resilient Distributed Datasets) to alternative approaches supported by Unity Catalog, you can follow these best practices and migration strategies:

Use DataFrame API: The DataFrame API is the recommended alternative to RDDs. It provides a higher-level abstraction for data processing and is optimized for performance. You can convert your existing RDD-based code to use DataFrames, which are supported in Unity Catalog.
Replace RDD Operations:
- For operations like sc.parallelize, use spark.createDataFrame with a list of dictionaries or Row objects.
- For creating empty DataFrames, use spark.createDataFrame with an empty list and a defined schema.
- For mapPartitions, rewrite the logic using DataFrame transformations and PySpark native Arrow UDFs.
Avoid Spark Context and SQL Context: Unity Catalog does not support direct access to Spark Context (sc) and SQL Context (sqlContext). Use the spark variable to interact with the SparkSession instance instead.
Use Volumes for File Access: Instead of using DBFS (Databricks File System) mount points, use Unity Catalog Volumes for file storage and access. This ensures that your data access is governed and secure.
Update Cluster Configurations: Ensure that your clusters are running Databricks Runtime 13.3 or higher, and configure them to use shared or single-user access modes as appropriate for your workloads.
Migrate Streaming Jobs: If you have streaming jobs that use RDDs, refactor them to use the Structured Streaming API. Ensure that checkpoint directories are moved to Volumes.
Handle UDFs and Libraries: For user-defined functions (UDFs) and custom libraries, ensure they are compatible with the DataFrame API and Unity Catalog. Use cluster policies to manage library installations.
Use the SYNC Command: For migrating tables from Hive/Glue to Unity Catalog, use the SYNC command to synchronize schema and table metadata.
Upgrade Managed and External Tables: Use the upgrade wizard in Data Explorer to upgrade managed and external tables to Unity Catalog. For managed tables, consider using DEEP CLONE for Delta tables to preserve the delta log.
Refactor Jobs and Notebooks: Evaluate and refactor your jobs and notebooks to ensure compatibility with Unity Catalog. This includes updating references to tables, paths, and configurations.

Databricks Community

Unity Catalog : RDD Issue

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!