cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Anonymous
by Not applicable
  • 1125 Views
  • 4 replies
  • 0 kudos

Objective is to make table unique at ID using group by , concat_ws and collect_list ,combining distinct values in one row.

Objective is to make table unique at ID. Table structure is as in attached image.Query used is : selectID,concat_ws(' & ' , collect_list(Distinct Gender)) as Genderfrom tablegroup by IDIt can be possible if we can order values within collect_list and...

  • 1125 Views
  • 4 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Rishabh Shanker​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

  • 0 kudos
3 More Replies
Harun
by Honored Contributor
  • 4340 Views
  • 2 replies
  • 0 kudos

Issue with Pyspark GroupBy GroupedData

Hi Guys,I am working on streaming data movement from bronze to silver. My bronze table is having a entity_name column, based on the entity_name column i need to create multiple silver tables.I tried the below approach, But it is failing with error "'...

  • 4340 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Harun Raseed Basheer​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best ...

  • 0 kudos
1 More Replies
tusworten
by New Contributor II
  • 4066 Views
  • 5 replies
  • 4 kudos

Spark SQL Group by duplicates, collect_list in array of structs and evaluate rows in each group.

I'm begginner working with Spark SQL in Java API. I have a dataset with duplicate clients grouped by ENTITY and DOCUMENT_ID like this:.withColumn( "ROWNUMBER", row_number().over(Window.partitionBy("ENTITY", "ENTITY_DOC").orderBy("ID")))I added a ROWN...

1
  • 4066 Views
  • 5 replies
  • 4 kudos
Latest Reply
tusworten
New Contributor II
  • 4 kudos

Hi @Kaniz Fatma​ Her answer didn't solve my problem but it was useful to learn more about UDFS, which I did not know.

  • 4 kudos
4 More Replies
SindhuG
by New Contributor
  • 621 Views
  • 1 replies
  • 0 kudos

Hi All, I need to extract rows of dates from a dataframe based on list of values(e.g. dates) located in a CSV file. Can anyone please help me? I have tried groupby function but am not able to get the expected result. Thanks in advance.

my dataframe looks like this.df = Datecolumn2column3Machine1-jan-2020A2-jan-2020--- A 18-jan-2020 A 11-jan-2020 B 12-jan-2020 B 6-feb-2020C7-feb-2020---C14-feb-2020C Date details csv file looks like this D = MachineSelected DateA15-jan-2020C12-f...

  • 621 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz
Community Manager
  • 0 kudos

Hi @ SindhuG! My name is Kaniz, and I'm a technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your questions first. Or else I will follow up shortly with a response.

  • 0 kudos
LaurentThiebaud
by New Contributor
  • 4865 Views
  • 1 replies
  • 0 kudos

Sort within a groupBy with dataframe

Using Spark DataFrame, eg. myDf .filter(col("timestamp").gt(15000)) .groupBy("groupingKey") .agg(collect_list("aDoubleValue")) I want the collect_list to return the result, but ordered according to "timestamp". i.a. I want the GroupBy results...

  • 4865 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @Laurent Thiebaud,Please use the below format to sort within a groupby, import org.apache.spark.sql.functions._ df.groupBy("columnA").agg(sort_array(collect_list("columnB")))

  • 0 kudos
Labels