<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Running unit tests and hyperopt causes a broadcast variable exception in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/running-unit-tests-and-hyperopt-causes-a-broadcast-variable/m-p/59826#M31510</link>
    <description>&lt;P&gt;Hello &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt; ,&lt;/P&gt;&lt;P&gt;thank you for the links provided.&lt;/P&gt;&lt;P&gt;To run our production workloads we use python jobs and not notebooks. Most of the time, we develop a new job locally against mocked/test datasets and than we run into the issue I described. We run integration tests against an Azure Databricks cluster much less often.&lt;/P&gt;&lt;P&gt;So my question really is, is there a way to install the Databricks fork of hyperopt - version&amp;nbsp;0.2.7&lt;STRONG&gt;+db1&lt;/STRONG&gt; locally?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 09 Feb 2024 19:18:14 GMT</pubDate>
    <dc:creator>Boyan</dc:creator>
    <dc:date>2024-02-09T19:18:14Z</dc:date>
    <item>
      <title>Running unit tests and hyperopt causes a broadcast variable exception</title>
      <link>https://community.databricks.com/t5/data-engineering/running-unit-tests-and-hyperopt-causes-a-broadcast-variable/m-p/57430#M30777</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;We are using hyperopt to train a model with relatively large train dataset.&lt;/P&gt;&lt;P&gt;We've experience some performance issues and following the suggestions in &lt;A href="https://docs.databricks.com/en/_extras/notebooks/source/hyperopt-spark-data.html" target="_self"&gt;this notebook&lt;/A&gt;, we broadcasted the dataset.&lt;/P&gt;&lt;P&gt;To verify that broadcasting the dataset resolved the performance issue, we did an experiment using&amp;nbsp;&lt;A href="https://docs.databricks.com/en/release-notes/runtime/11.3lts-ml.html" target="_self"&gt;Databricks Runtime for Machine Learning&lt;/A&gt; and a Notebook. We did see a significant performance boost.&lt;/P&gt;&lt;P&gt;To deploy our code, we package it as a .whl file and utilize &lt;A href="https://docs.databricks.com/en/workflows/jobs/jobs-2.0-api.html#jobssparkpythontask" target="_self"&gt;python jobs&lt;/A&gt; to deploy it to an &lt;A href="https://azure.microsoft.com/en-us/products/databricks" target="_self"&gt;Azure Databricks Service&lt;/A&gt;. Provided we run the job using&amp;nbsp;&lt;A href="https://docs.databricks.com/en/release-notes/runtime/11.3lts-ml.html" target="_self"&gt;Databricks Runtime for Machine Learning&lt;/A&gt;, we do not have any issues.&lt;/P&gt;&lt;P&gt;We run into the following issues&amp;nbsp;"Broadcast variable '5' not loaded!", when we run unit tests for our jobs locally or via our CICD pipelines.&lt;/P&gt;&lt;P&gt;This appears to be a known bug in the hyperopt library and there is a &lt;A href="https://github.com/hyperopt/hyperopt/pull/856" target="_self"&gt;fix&lt;/A&gt; merged to master but it is not released.&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/en/release-notes/runtime/11.3lts-ml.html" target="_self"&gt;Databricks Runtime for Machine Learning&lt;/A&gt;&amp;nbsp;ships with a Databricks fork of hyperopt - version&amp;nbsp;0.2.7&lt;STRONG&gt;+db1&lt;/STRONG&gt;, which has a fix too.&lt;/P&gt;&lt;P&gt;Given that this fork is only available on Databricks Runtimes for Machine Learning, what is the recommended approach to run unit tests on CI/CD infrastructure or local development machines?&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jan 2024 10:07:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/running-unit-tests-and-hyperopt-causes-a-broadcast-variable/m-p/57430#M30777</guid>
      <dc:creator>Boyan</dc:creator>
      <dc:date>2024-01-16T10:07:28Z</dc:date>
    </item>
    <item>
      <title>Re: Running unit tests and hyperopt causes a broadcast variable exception</title>
      <link>https://community.databricks.com/t5/data-engineering/running-unit-tests-and-hyperopt-causes-a-broadcast-variable/m-p/59826#M31510</link>
      <description>&lt;P&gt;Hello &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt; ,&lt;/P&gt;&lt;P&gt;thank you for the links provided.&lt;/P&gt;&lt;P&gt;To run our production workloads we use python jobs and not notebooks. Most of the time, we develop a new job locally against mocked/test datasets and than we run into the issue I described. We run integration tests against an Azure Databricks cluster much less often.&lt;/P&gt;&lt;P&gt;So my question really is, is there a way to install the Databricks fork of hyperopt - version&amp;nbsp;0.2.7&lt;STRONG&gt;+db1&lt;/STRONG&gt; locally?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 09 Feb 2024 19:18:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/running-unit-tests-and-hyperopt-causes-a-broadcast-variable/m-p/59826#M31510</guid>
      <dc:creator>Boyan</dc:creator>
      <dc:date>2024-02-09T19:18:14Z</dc:date>
    </item>
  </channel>
</rss>

