cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Parse_Syntax_Error Help

deltax_07
New Contributor

i'm getting this error: Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: [PARSE_SYNTAX_ERROR] Syntax error at or near ','.(line 1, pos 18) == SQL == sum(mp4) AS Videos, sum(csv+xlsx) AS Sheets, sum(docx+txt+pdf) AS Documents, sum(zip+html+pptx) AS Others, sum(gif+jgp+png) AS Images ------------------^^^
for my code in scala:

// Define field mappings and aggregations
val fieldMappings = Map(
"Documents" -> Seq("docx", "txt", "pdf"),
"Sheets" -> Seq("csv", "xlsx"),
"Images" -> Seq("gif", "jgp", "png"),
"Videos" -> Seq("mp4"),
"Others" -> Seq("zip", "html", "pptx")
)

// // Define aggregation pipeline
val aggregatedDF = inputDF.groupBy("u_id").agg(
expr(
fieldMappings.map { case (category, fields) =>
sum(fields.mkString("+")).alias(category)
}.mkString(", ")
)
)
1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @deltax_07The error you’re encountering seems to be related to the syntax in your Spark SQL query. Let’s break it down and address the issue.

The problematic part of your query is this section:

sum(mp4) AS Videos, sum(csv+xlsx) AS Sheets, sum(docx+txt+pdf) AS Documents, sum(zip+html+pptx) AS Others, sum(gif+jgp+png) AS Images

The issue lies in the way you’re trying to sum up the file extensions. In Spark SQL, you cannot directly sum strings like mp4, csv+xlsx, etc. Instead, you need to use a different approach.

Here’s how you can modify your aggregation pipeline to correctly calculate the sums for each category:

// Define field mappings and aggregations
val fieldMappings = Map(
  "Documents" -> Seq("docx", "txt", "pdf"),
  "Sheets" -> Seq("csv", "xlsx"),
  "Images" -> Seq("gif", "jgp", "png"),
  "Videos" -> Seq("mp4"),
  "Others" -> Seq("zip", "html", "pptx")
)

// Define aggregation pipeline
val aggregatedDF = inputDF.groupBy("u_id").agg(
  fieldMappings.map { case (category, fields) =>
    fields.map(f => col(f)).reduce(_ + _).alias(category)
  }.values.toSeq: _*
)

In this modified code:

  • We use col(f) to reference each column (file extension) individually.
  • We use reduce(_ + _) to sum up the columns within each category.
  • The .values.toSeq: _* part ensures that the resulting expressions are passed as separate arguments to the agg function.

This should resolve the syntax error you’re encountering. Give it a try, and let me know if you need further assistance! 😊

 
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.