I need to process a number of files where I manipulate file text utilising an external executable that operates on stdin/stdout.
I am quite new to spark. What I am attempting is to use rdd.pipe as in the following
exe_path = " /usr/local/bin/external-exe"
files_rdd = spark.sparkContext.parallelize(files_list)
pipe_tokenised_rdd = files_rdd.pipe(exe_path})
pipe_tokenised_rdd.collect())
Is this approach using rdd.pipe the way that external code should be used in general? Do I need to use rdd.pipe or should I use dataframe with transform. Looking for advice on approaches.