Optimal approach when using external script/executable for processing data

mick042 — Tue, 14 Jun 2022 13:04:25 GMT

I need to process a number of files where I manipulate file text utilising an external executable that operates on stdin/stdout.

I am quite new to spark. What I am attempting is to use rdd.pipe as in the following

exe_path =  " /usr/local/bin/external-exe"
files_rdd  = spark.sparkContext.parallelize(files_list)
pipe_tokenised_rdd = files_rdd.pipe(exe_path})
pipe_tokenised_rdd.collect())

Is this approach using rdd.pipe the way that external code should be used in general? Do I need to use rdd.pipe or should I use dataframe with transform. Looking for advice on approaches.

Re: Optimal approach when using external script/executable for processing data

User16753725469 — Fri, 09 Sep 2022 15:21:26 GMT

Hi @Michael Lennon Can you please elaborate use case on what the external app is doing exe_path

topic Optimal approach when using external script/executable for processing data in Data Engineering

Optimal approach when using external script/executable for processing data

Re: Optimal approach when using external script/executable for processing data