cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
cancel
Showing results for 
Search instead for 
Did you mean: 

How to find the distinct count of the below listed result from the table?

SivaPK
New Contributor II

Hi,

How to get the distinct count from the below listed image,

keywords = column name

table = appCatalog

keywords (column)

"[""data"",""cis"",""mining"",""financial"",""pso"",""value""]"
"[""bzo"",""employee news"",""news""]"
"[""core.store"",""fbi""]"
"[""data"",""display"",""support""]"
"[""bTravel"",""bTravel Expense"",""non-travel-related expenses"",""travel expenses""]"

For me output should be like, 17 as result. While counting the distinct values it should result 17.

keyword name called  'data' is occurring 2 times in the result as shown above  Row 1 and Row 4.

How to achieve the result?

Thanks in advance.

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @SivaPK , Try this :-


# Assuming you have a DataFrame named 'df' with a column 'keywords'
from pyspark.sql.functions import explode, col

# Explode the array column 'keywords'
exploded_df = df.select(explode(col("keywords")).alias("keyword"))

# Count the distinct keywords
distinct_count = exploded_df.select("keyword").distinct().count()

print(f"Distinct keyword count: {distinct_count}")

 

If you have any further questions, feel free to ask! 🚀

 

SivaPK
New Contributor II

Hi Kaniz,

Thank you for your reply and time.

I have tried the above query in the SQL Editor. I have to display the distinct count of the keywords from the table.

datatype is string. I have uploaded the csv file as a data for the Table. How to use the previous comment code in SQL Editor in Databricks.?

Query:

select keywords, count(1) as Number_Occurences
from (
  -- This subquery transforms the hashtags array into rows
  select id
        ,keywords as Orignal_Keywords_Array
        ,explode(keywords) as keywords
  from ngportal.applicationcatalog
  order by id  
)
group by keywords
order by count(1) desc

 

Error:

[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "explode(keywords)" due to data type mismatch: Parameter 1 requires the ("ARRAY" or "MAP") type, however "keywords" has the type "STRING"

Thank you.

Kaniz
Community Manager
Community Manager

Hi @SivaPK , It seems you’re encountering an error related to the explode function in your SQL query.

 

 Let’s address this step by step.

 

Data Type Mismatch: The error message indicates that the explode function expects an “ARRAY” or “MAP” type as its parameter, but the column keywords in your table has the data type “STRING.”

 

Using explode in Databricks SQL: To use the explode function in Databricks SQL, you need to apply it to an array or map expression. Since your keywords column is of type string, you’ll need to transform it into an array before using explode.

 

Remember to adjust the delimiter (e.g., comma) in the SPLIT function based on how your keywords are separated in the original string. 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.