The problematic part of your query is this section:
sum(mp4) AS Videos, sum(csv+xlsx) AS Sheets, sum(docx+txt+pdf) AS Documents, sum(zip+html+pptx) AS Others, sum(gif+jgp+png) AS Images
The issue lies in the way you’re trying to sum up the file extensions. In Spark SQL, you cannot directly sum strings like mp4
, csv+xlsx
, etc. Instead, you need to use a different approach.
Here’s how you can modify your aggregation pipeline to correctly calculate the sums for each category:
val fieldMappings = Map(
"Documents" -> Seq("docx", "txt", "pdf"),
"Sheets" -> Seq("csv", "xlsx"),
"Images" -> Seq("gif", "jgp", "png"),
"Videos" -> Seq("mp4"),
"Others" -> Seq("zip", "html", "pptx")
)
val aggregatedDF = inputDF.groupBy("u_id").agg(
fieldMappings.map { case (category, fields) =>
fields.map(f => col(f)).reduce(_ + _).alias(category)
}.values.toSeq: _*
)
In this modified code:
- We use
col(f)
to reference each column (file extension) individually.
- We use
reduce(_ + _)
to sum up the columns within each category.
- The
.values.toSeq: _*
part ensures that the resulting expressions are passed as separate arguments to the agg
function.
This should resolve the syntax error you’re encountering. Give it a try, and let me know if you need further assistance! 😊