Databricks

Khalil · ‎04-19-2023

I wanna apply a pivot on a dataframe in DLT but I'm having the following warning

Notebook:XXXX used `GroupedData.pivot` function that will be deprecated soon. Please fix the notebook.

I have the same warning if I use the the function collect.

Is it risky not to correct it.

Khalil · ‎04-26-2023

Thanks @Kaniz Fatma for your support.

The solution was to do the pivot outside of views or tables and the warning disappeared.

View solution in original post

Kaniz · ‎04-19-2023

Hi @Ibrahima Fall, It's essential to address the deprecation warnings in your code, as deprecated functions might be removed in future updates of the libraries, causing your code to break.

In this case, you're being warned about using the GroupedData.pivot function in Delta Lake.

To address the deprecation warning, you can use the pivot function directly on your DataFrame. Here's an example of how to use it:

Suppose you have a DataFrame df with columns "category", "type", and "value". You want to pivot the DataFrame based on the "type" column and sum the "value" column.

Instead of using GroupedData.pivot like this:

result = df.groupBy("category").pivot("type").sum("value")

You can use the pivot function directly on the DataFrame:

result = df.groupBy("category").pivot("type", distinct_types).sum("value")

In the above example, distinct_types is a list of distinct values present in the "type" column. You can obtain this list using the distinct and collect method:

distinct_types = df.select("type").distinct().rdd.flatMap(lambda x: x).collect()

The collect function is generally safe to use, but it can cause issues if you use it on a large DataFrame. The collect function retrieves all the data from the DataFrame and stores it in the driver's memory. If the DataFrame is too large, it might cause the driver to run out of memory and crash the application.

If you only need a subset of the DataFrame, consider using take or head functions instead of

collect. These functions allow you to specify the number of rows you want to retrieve, avoiding potential memory issues:

# Retrieve the first 10 rows of the DataFrame
subset = df.take(10)

In summary, it's best to address deprecation warnings to ensure your code continues to work with future updates of the libraries. Additionally, be cautious when using the collect

function on large DataFrames to avoid potential memory issues.

Khalil · ‎04-19-2023

Thank you @Kaniz Fatma for that great answer. This can be a good workaround but the other issue I am facing is that collect function will be deprecated soon as well in Delta Live Table.

Kaniz · ‎04-20-2023

Hi @Ibrahima Fall, I understand your concern about deprecating the collect function in Delta Live Tables. To address this, you can use alternative methods to achieve the same functionality.

One approach is to utilize the toPandas() function to convert your DataFrame into a Pandas DataFrame. This way, you can collect the data locally without using the collect() function.

For example:-

import pandas as pd
 
# Assuming you have a DataFrame called 'df'
pandas_df = df.toPandas()
 
# Use Pandas DataFrame to perform operations

However, it is essential to note that using toPandas() could cause memory issues if you work with large DataFrames. The entire DataFrame will be loaded into the driver node's memory.

Suppose you only need a small subset of the data. In that case, you can consider using take() or head() functions to retrieve a specific number of rows from the DataFrame, avoiding potential memory issues:

# Retrieve the first 10 rows of the DataFrame
subset = df.take(10)

If you need to work with larger DataFrames, you can perform your operations and transformations using Spark and then save the data to a Delta table or another output format like CSV or Parquet. This way, you can avoid the need to collect data locally in the driver node, making your processing more scalable and efficient.

Khalil · ‎04-20-2023

Hi @Kaniz Fatma , I get your point; it might be risky to use toPandas as you said but as I need just to retrieve distinct values from a specific column of the dataframe so toPandas might be used in this case.

I will try this approch and let you know.

Anonymous · ‎04-23-2023

Hi @Ibrahima Fall

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

Khalil · ‎04-26-2023

Thanks @Kaniz Fatma for your support.

The solution was to do the pivot outside of views or tables and the warning disappeared.

Databricks

Pivot a DataFrame in Delta Live Table DLT

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI