cancel
Showing results for 
Search instead for 
Did you mean: 

Pivot a DataFrame in Delta Live Table DLT

Khalil
New Contributor III

I wanna apply a pivot on a dataframe in DLT but I'm having the following warning

Notebook:XXXX used `GroupedData.pivot` function that will be deprecated soon. Please fix the notebook.

I have the same warning if I use the the function collect.

Is it risky not to correct it.

1 ACCEPTED SOLUTION

Accepted Solutions

Khalil
New Contributor III

Thanks @Kaniz Fatma​  for your support.

The solution was to do the pivot outside of views or tables and the warning disappeared.

View solution in original post

6 REPLIES 6

Kaniz
Community Manager
Community Manager

Hi @Ibrahima Fall​, It's essential to address the deprecation warnings in your code, as deprecated functions might be removed in future updates of the libraries, causing your code to break.

In this case, you're being warned about using the GroupedData.pivot function in Delta Lake.

To address the deprecation warning, you can use the pivot function directly on your DataFrame. Here's an example of how to use it:

Suppose you have a DataFrame df with columns "category", "type", and "value". You want to pivot the DataFrame based on the "type" column and sum the "value" column.

Instead of using GroupedData.pivot like this:

result = df.groupBy("category").pivot("type").sum("value")

You can use the pivot function directly on the DataFrame:

result = df.groupBy("category").pivot("type", distinct_types).sum("value")

In the above example, distinct_types is a list of distinct values present in the "type" column. You can obtain this list using the distinct and collect method:

distinct_types = df.select("type").distinct().rdd.flatMap(lambda x: x).collect()

The collect function is generally safe to use, but it can cause issues if you use it on a large DataFrame. The collect function retrieves all the data from the DataFrame and stores it in the driver's memory. If the DataFrame is too large, it might cause the driver to run out of memory and crash the application.

If you only need a subset of the DataFrame, consider using take or head functions instead of

collect. These functions allow you to specify the number of rows you want to retrieve, avoiding potential memory issues:

# Retrieve the first 10 rows of the DataFrame
subset = df.take(10)

In summary, it's best to address deprecation warnings to ensure your code continues to work with future updates of the libraries. Additionally, be cautious when using the collect

function on large DataFrames to avoid potential memory issues.

Khalil
New Contributor III

Thank you @Kaniz Fatma​  for that great answer. This can be a good workaround but the other issue I am facing is that collect function will be deprecated soon as well in Delta Live Table.

Kaniz
Community Manager
Community Manager

Hi @Ibrahima Fall​, I understand your concern about deprecating the collect function in Delta Live Tables. To address this, you can use alternative methods to achieve the same functionality.

One approach is to utilize the toPandas() function to convert your DataFrame into a Pandas DataFrame. This way, you can collect the data locally without using the collect() function.

For example:-

import pandas as pd
 
# Assuming you have a DataFrame called 'df'
pandas_df = df.toPandas()
 
# Use Pandas DataFrame to perform operations

However, it is essential to note that using toPandas() could cause memory issues if you work with large DataFrames. The entire DataFrame will be loaded into the driver node's memory.

Suppose you only need a small subset of the data. In that case, you can consider using take() or head() functions to retrieve a specific number of rows from the DataFrame, avoiding potential memory issues:

# Retrieve the first 10 rows of the DataFrame
subset = df.take(10)

If you need to work with larger DataFrames, you can perform your operations and transformations using Spark and then save the data to a Delta table or another output format like CSV or Parquet. This way, you can avoid the need to collect data locally in the driver node, making your processing more scalable and efficient.

Khalil
New Contributor III

Hi @Kaniz Fatma​ , I get your point; it might be risky to use toPandas as you said but as I need just to retrieve distinct values from a specific column of the dataframe so toPandas might be used in this case.

I will try this approch and let you know.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.