Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Objective is to make table unique at ID. Table structure is as in attached image.Query used is : selectID,concat_ws(' & ' , collect_list(Distinct Gender)) as Genderfrom tablegroup by IDIt can be possible if we can order values within collect_list and...
Hi @Rishabh Shanker Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...
Hi All, I am trying to understand the internals shuffle hash join. I want to check if my understanding of it is correct. Let’s say I have two tables t1 and t2 joined on column country (8 distinct values). If I set the number of shuffle partitions as ...
@Vinay Emmadi : In Spark, a hash shuffle join is a type of join that is used when joining two data sets on a common key. The data is first partitioned based on the join key, and then each partition is shuffled and sent to a node in the cluster. The ...
Hi, I am new to DB SQL. I have a table where the array column (cities) contains multiple arrays and some have multiple duplicate values. I need to unpack the array values into rows so I can list the distinct values. The following query works for this...
Hi All, I have been having an issue identifying how to do a uniqueness check for the quality check. Below is an example. @dlt.expect("origin_not_dup", "origin is distinct from origin")
def harmonized_data():
df=dlt.read("raw_data")
for col in...
Show all distinct values per column in dataframeProblem Statement:I want to see all the distinct values per column for my entire table, but a SQL query with a collect_set() on every column is not dynamic and too long to write.Use this code to show th...