cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Optimal approach when using external script/executable for processing data

mick042
New Contributor III

I need to process a number of files where I manipulate file text utilising an external executable that operates on stdin/stdout.

I am quite new to spark. What I am attempting is to use rdd.pipe as in the following

exe_path =  " /usr/local/bin/external-exe"
files_rdd  = spark.sparkContext.parallelize(files_list)
pipe_tokenised_rdd = files_rdd.pipe(exe_path})
pipe_tokenised_rdd.collect())

Is this approach using rdd.pipe the way that external code should be used in general? Do I need to use rdd.pipe or should I use dataframe with transform. Looking for advice on approaches.

1 REPLY 1

User16753725469
Contributor II

Hi @Michael Lennon​  Can you please elaborate use case on what the external app is doing exe_path

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.