<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Optimal approach when using external script/executable for processing data in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/optimal-approach-when-using-external-script-executable-for/m-p/17722#M11685</link>
    <description>&lt;P&gt;I need to process a number of files where I manipulate file text utilising an external executable that operates on stdin/stdout.&lt;/P&gt;&lt;P&gt; I am quite new to spark. What I am attempting is to use rdd.pipe as in the following&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;exe_path =  " /usr/local/bin/external-exe"
files_rdd  = spark.sparkContext.parallelize(files_list)
pipe_tokenised_rdd = files_rdd.pipe(exe_path})
pipe_tokenised_rdd.collect())&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Is this approach using rdd.pipe the way that external code should be used in general? Do I need to use rdd.pipe or should  I use dataframe with transform. Looking for advice on approaches.&lt;/P&gt;</description>
    <pubDate>Tue, 14 Jun 2022 13:04:25 GMT</pubDate>
    <dc:creator>mick042</dc:creator>
    <dc:date>2022-06-14T13:04:25Z</dc:date>
    <item>
      <title>Optimal approach when using external script/executable for processing data</title>
      <link>https://community.databricks.com/t5/data-engineering/optimal-approach-when-using-external-script-executable-for/m-p/17722#M11685</link>
      <description>&lt;P&gt;I need to process a number of files where I manipulate file text utilising an external executable that operates on stdin/stdout.&lt;/P&gt;&lt;P&gt; I am quite new to spark. What I am attempting is to use rdd.pipe as in the following&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;exe_path =  " /usr/local/bin/external-exe"
files_rdd  = spark.sparkContext.parallelize(files_list)
pipe_tokenised_rdd = files_rdd.pipe(exe_path})
pipe_tokenised_rdd.collect())&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Is this approach using rdd.pipe the way that external code should be used in general? Do I need to use rdd.pipe or should  I use dataframe with transform. Looking for advice on approaches.&lt;/P&gt;</description>
      <pubDate>Tue, 14 Jun 2022 13:04:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimal-approach-when-using-external-script-executable-for/m-p/17722#M11685</guid>
      <dc:creator>mick042</dc:creator>
      <dc:date>2022-06-14T13:04:25Z</dc:date>
    </item>
    <item>
      <title>Re: Optimal approach when using external script/executable for processing data</title>
      <link>https://community.databricks.com/t5/data-engineering/optimal-approach-when-using-external-script-executable-for/m-p/17723#M11686</link>
      <description>&lt;P&gt;Hi @Michael Lennon​&amp;nbsp; Can you please elaborate use case on what the external app is doing exe_path&lt;/P&gt;</description>
      <pubDate>Fri, 09 Sep 2022 15:21:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimal-approach-when-using-external-script-executable-for/m-p/17723#M11686</guid>
      <dc:creator>User16753725469</dc:creator>
      <dc:date>2022-09-09T15:21:26Z</dc:date>
    </item>
  </channel>
</rss>

