cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to find the distinct count of the below listed result from the table?

SivaPK
New Contributor II

Hi,

How to get the distinct count from the below listed image,

keywords = column name

table = appCatalog

keywords (column)

"[""data"",""cis"",""mining"",""financial"",""pso"",""value""]"
"[""bzo"",""employee news"",""news""]"
"[""core.store"",""fbi""]"
"[""data"",""display"",""support""]"
"[""bTravel"",""bTravel Expense"",""non-travel-related expenses"",""travel expenses""]"

For me output should be like, 17 as result. While counting the distinct values it should result 17.

keyword name called  'data' is occurring 2 times in the result as shown above  Row 1 and Row 4.

How to achieve the result?

Thanks in advance.

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @SivaPK , Try this :-


# Assuming you have a DataFrame named 'df' with a column 'keywords'
from pyspark.sql.functions import explode, col

# Explode the array column 'keywords'
exploded_df = df.select(explode(col("keywords")).alias("keyword"))

# Count the distinct keywords
distinct_count = exploded_df.select("keyword").distinct().count()

print(f"Distinct keyword count: {distinct_count}")

 

If you have any further questions, feel free to ask! 🚀

 

SivaPK
New Contributor II

Hi Kaniz,

Thank you for your reply and time.

I have tried the above query in the SQL Editor. I have to display the distinct count of the keywords from the table.

datatype is string. I have uploaded the csv file as a data for the Table. How to use the previous comment code in SQL Editor in Databricks.?

Query:

select keywords, count(1) as Number_Occurences
from (
  -- This subquery transforms the hashtags array into rows
  select id
        ,keywords as Orignal_Keywords_Array
        ,explode(keywords) as keywords
  from ngportal.applicationcatalog
  order by id  
)
group by keywords
order by count(1) desc

 

Error:

[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "explode(keywords)" due to data type mismatch: Parameter 1 requires the ("ARRAY" or "MAP") type, however "keywords" has the type "STRING"

Thank you.

Hi @SivaPK , It seems you’re encountering an error related to the explode function in your SQL query.

 

 Let’s address this step by step.

 

Data Type Mismatch: The error message indicates that the explode function expects an “ARRAY” or “MAP” type as its parameter, but the column keywords in your table has the data type “STRING.”

 

Using explode in Databricks SQL: To use the explode function in Databricks SQL, you need to apply it to an array or map expression. Since your keywords column is of type string, you’ll need to transform it into an array before using explode.

 

Remember to adjust the delimiter (e.g., comma) in the SPLIT function based on how your keywords are separated in the original string. 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!