<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: What's the best way to run a databricks notebook from AWS Lambda ? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-run-a-databricks-notebook-from-aws-lambda/m-p/26049#M18174</link>
    <description>&lt;P&gt;basically you are looking for an event based trigger, where the event is the arrival of a new file.&lt;/P&gt;&lt;P&gt;It´s been a while since I worked on AWS but doesn´t Glue have a functionality like this?&lt;/P&gt;&lt;P&gt;On Azure I do exactly the same thing: when a file arrives in a certain location, a data pipeline starts containing a dbrx notebook.&lt;/P&gt;&lt;P&gt;If you want minimal latency, I suggest you use a pool with active workers, but that comes with a price ofc.&lt;/P&gt;</description>
    <pubDate>Thu, 10 Mar 2022 06:58:15 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2022-03-10T06:58:15Z</dc:date>
    <item>
      <title>What's the best way to run a databricks notebook from AWS Lambda ?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-run-a-databricks-notebook-from-aws-lambda/m-p/26046#M18171</link>
      <description>&lt;P&gt;I have a trigger in lambda that gets triggered when a new file arrives in S3. I want this file to be straightaway processed using a notebook to Upsert all the data into a delta table.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm looking for a solution with minimum latency.&lt;/P&gt;</description>
      <pubDate>Thu, 10 Mar 2022 00:11:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-run-a-databricks-notebook-from-aws-lambda/m-p/26046#M18171</guid>
      <dc:creator>AmanSehgal</dc:creator>
      <dc:date>2022-03-10T00:11:12Z</dc:date>
    </item>
    <item>
      <title>Re: What's the best way to run a databricks notebook from AWS Lambda ?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-run-a-databricks-notebook-from-aws-lambda/m-p/26047#M18172</link>
      <description>&lt;P&gt;Hi @Aman Sehgal​&amp;nbsp; : I am not sure if you have already explored databricks Autoloader and the limitation for your use case. otherwise you can try using Autoloader and avoid multiple processes.&lt;/P&gt;</description>
      <pubDate>Thu, 10 Mar 2022 04:25:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-run-a-databricks-notebook-from-aws-lambda/m-p/26047#M18172</guid>
      <dc:creator>RKNutalapati</dc:creator>
      <dc:date>2022-03-10T04:25:12Z</dc:date>
    </item>
    <item>
      <title>Re: What's the best way to run a databricks notebook from AWS Lambda ?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-run-a-databricks-notebook-from-aws-lambda/m-p/26048#M18173</link>
      <description>&lt;P&gt;I've used autoloader.. And it works like a charm.. &lt;/P&gt;&lt;P&gt;But  I'm not sure about what should be the interval for trigger processing time.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I was thinking of it as trigger once only, and when the file arrives in S3, then the lambda will trigger the notebook for processing the file.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The thing is that files could arrive like 5  every minute or maybe once every 3 hours.. The frequency is not set..&lt;/P&gt;&lt;P&gt;But whenever the file arrives, it should be processed with minimum latency.&lt;/P&gt;</description>
      <pubDate>Thu, 10 Mar 2022 06:18:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-run-a-databricks-notebook-from-aws-lambda/m-p/26048#M18173</guid>
      <dc:creator>AmanSehgal</dc:creator>
      <dc:date>2022-03-10T06:18:48Z</dc:date>
    </item>
    <item>
      <title>Re: What's the best way to run a databricks notebook from AWS Lambda ?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-run-a-databricks-notebook-from-aws-lambda/m-p/26049#M18174</link>
      <description>&lt;P&gt;basically you are looking for an event based trigger, where the event is the arrival of a new file.&lt;/P&gt;&lt;P&gt;It´s been a while since I worked on AWS but doesn´t Glue have a functionality like this?&lt;/P&gt;&lt;P&gt;On Azure I do exactly the same thing: when a file arrives in a certain location, a data pipeline starts containing a dbrx notebook.&lt;/P&gt;&lt;P&gt;If you want minimal latency, I suggest you use a pool with active workers, but that comes with a price ofc.&lt;/P&gt;</description>
      <pubDate>Thu, 10 Mar 2022 06:58:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-run-a-databricks-notebook-from-aws-lambda/m-p/26049#M18174</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-03-10T06:58:15Z</dc:date>
    </item>
    <item>
      <title>Re: What's the best way to run a databricks notebook from AWS Lambda ?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-run-a-databricks-notebook-from-aws-lambda/m-p/26050#M18175</link>
      <description>&lt;P&gt;There are two possible solution:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html" alt="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html" target="_blank"&gt;autoloader/cloudfiles&lt;/A&gt;, better with "&lt;B&gt;File notification" &lt;/B&gt;queue to avoid unnecessary scans,&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;OR&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;from lambda  sending post request to &lt;A href="https://docs.databricks.com/dev-tools/api/latest/jobs.html" alt="https://docs.databricks.com/dev-tools/api/latest/jobs.html" target="_blank"&gt;/api/2.1/jobs/run-now&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Additionally in both solution it is important to have private link and access via role (to skip authentication).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In first additionally S3 have to be mounted in databricks. In first you can also use advantage of spark parallelism as multiple virtual machines will read and write in the same time.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In second if there is no any delay in trigger on S3, AWS lambda will be run quicker but than executing notebook through API will be slower as it can take dozens of seconds to run job.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Me would go for autloader if there are many files per hour.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If it is 1 file per hour or less I would go with job trigger through REST API.&lt;/P&gt;</description>
      <pubDate>Thu, 10 Mar 2022 13:45:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-run-a-databricks-notebook-from-aws-lambda/m-p/26050#M18175</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-03-10T13:45:01Z</dc:date>
    </item>
  </channel>
</rss>

