Databricks Community

mick042 · ‎06-14-2022

I need to process a number of files where I manipulate file text utilising an external executable that operates on stdin/stdout.

I am quite new to spark. What I am attempting is to use rdd.pipe as in the following

exe_path =  " /usr/local/bin/external-exe"
files_rdd  = spark.sparkContext.parallelize(files_list)
pipe_tokenised_rdd = files_rdd.pipe(exe_path})
pipe_tokenised_rdd.collect())

Is this approach using rdd.pipe the way that external code should be used in general? Do I need to use rdd.pipe or should I use dataframe with transform. Looking for advice on approaches.