cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Upset Plot in Databricks

PJ11
New Contributor

I am trying to create an Upset Plot using following code, but my output is not as expected. See Image1: Output which I am getting vs Image2: Output expected. Where the total count of overlap is displayed at the top of each bar, Bar size is proportionate with the count. it's in descending order. Please help me with this. Thank you.

The code is 

import pandas as pd

pip install upsetplot

from upsetplot import UpSet

import matplotlib.pyplot as plt

df_spark=spark.table("hive_metastore.sndbx_cx.q3_data")


df_spark.show(5)

df=df_spark.toPandas()

df=df.drop(columns=['member_id'])

df_counts=df_multi.value_counts()

df=df.astype(bool)

from upsetplot import UpSet

df_multiindex=df.groupby(list(df.columns)).size()

fig=plt.figure(figsize=(15,10),dpi=150)

upset=UpSet(df_multiindex,subset_size='count',show_counts=True,min_subset_size=1,max_subset_size=15)
upset.plot()

plt.show()

1 REPLY 1

filipniziol
Contributor III

Hi @PJ11 ,

As per documentation

UpSetPlot internally works with data based on Pandas data structures: a Series when all you care about is counts, or a DataFrame when youโ€™re interested in visualising additional properties of the data, such as with the UpSet.add_catplot method.

UpSetPlot expects the Series or DataFrame to have a MultiIndex as input, with this index being an indicator matrix. Specifically, each category is a level in the pandas.MultiIndex with boolean values.

Make sure to convert your your spark dataframes to pandas dataframes.

df_pandas = df.toPandas()


Please check below end-to-end example, that includes:

1. Initial data preparation (to simulate some data you can read from a table)

# Import necessary libraries
import pandas as pd
from upsetplot import UpSet
import matplotlib.pyplot as plt

from pyspark.sql.functions import explode, array, lit, col
from pyspark.sql.types import BooleanType

# Simulate data for the Spark DataFrame
data = [
    (1, False, False, False, False),
    (2, False, True, False, False),
    (3, False, True, True, False),
    (4, False, False, True, False),
    (5, True, False, False, False),
    (6, True, True, False, False),
    (7, False, False, False, True),
    (8, True, False, True, False),
    (9, True, True, True, False),
    (10, False, True, False, True),
    (11, False, False, True, True),
    (12, False, True, True, True),
    (13, True, False, False, True),
    (14, True, True, False, True),
    (15, True, True, True, True),
    (16, True, False, True, True),
]

counts = [169, 159, 110, 108, 84, 69, 49, 46, 43, 35, 32, 31, 22, 20, 14, 9]

# Create a Spark DataFrame
columns = ['member_id', 'Feature_A', 'Feature_B', 'Feature_C', 'Feature_D']
df_spark = spark.createDataFrame([dict(zip(columns, row)) for row in data])

# Create a list of counts as an array column
df_with_counts = df_spark.withColumn('count', lit(0))  # Placeholder column

for i, count in enumerate(counts):
    df_with_counts = df_with_counts.union(
        df_spark.filter(col('member_id') == data[i][0])
        .withColumn('count', array([lit(1)] * count))
        .withColumn('count', explode(col('count')))
    )

# Remove the placeholder rows
df_with_counts = df_with_counts.filter(col('count') != 0)

# Drop 'member_id' column
df_transformed = df_with_counts.drop('member_id', 'count')

for col_name in ['Feature_A', 'Feature_B', 'Feature_C', 'Feature_D']:
    df_transformed = df_transformed.withColumn(col_name, col(col_name).cast(BooleanType()))

display(df_transformed)

2. Conversions to pandas objects and then creation of the upset plot

# Convert the Spark DataFrame to a pandas DataFrame
df_pandas = df_transformed.toPandas()

# Group by all feature columns and count the occurrences
df_counts = df_pandas.groupby(['Feature_A', 'Feature_B', 'Feature_C', 'Feature_D']).size().reset_index(name='count')

# Create a MultiIndex from the feature columns
index = pd.MultiIndex.from_frame(df_counts[['Feature_A', 'Feature_B', 'Feature_C', 'Feature_D']])

# Create a Series with counts, using the MultiIndex
data_counts = pd.Series(df_counts['count'].values, index=index)

# Display the Series (optional)
print("Series with MultiIndex:")
display(data_counts)

upset = UpSet(
    data_counts,
    subset_size='auto',          # Use the counts provided in the Series
    show_counts='%d',            # Display counts on top of bars
    sort_by='cardinality',       # Sort subsets by counts
    sort_categories_by='-input',
    facecolor='blue',            # Set bar color to blue
)

# Generate the plot
fig = plt.figure(figsize=(12, 8))
upset.plot(fig=fig)

# Display the plot
plt.show()

Final output:

filipniziol_0-1729325359814.png

 

 

 

 

 

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group