<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Custom sentence transformer for indexing in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/custom-sentence-transformer-for-indexing/m-p/138151#M1346</link>
    <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;To use a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;custom MLflow pyfunc model&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;for sentence-transformers&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;with preprocessing&lt;/STRONG&gt;, you need to comply with the expected interface of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;mlflow.pyfunc.PythonModel&lt;/CODE&gt;, especially the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;method. The method signature, data handling, and serialization are key points. Below is a direct answer with practical explanation and guidelines.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Required Methods for mlflow.pyfunc.PythonModel&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;only method you must implement is&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict(self, context, model_input)&lt;/CODE&gt;&lt;/STRONG&gt;.&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;CODE&gt;context&lt;/CODE&gt;: MLflow-provided info (artifacts, configs, etc.).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;CODE&gt;model_input&lt;/CODE&gt;: The input passed during inference (usually Pandas DataFrame, NumPy array, or Python native types).&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Guidelines and Typical Pattern&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Load everything needed in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;load_context&lt;/CODE&gt;, which runs once when the model is loaded by MLflow.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Accept both batch (DataFrame/array) and single-input cases in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The output of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;should be directly serializable (ideally array-like or DataFrame).&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Example Template&lt;/H2&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;&lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; mlflow&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;pyfunc
&lt;SPAN class="token token"&gt;from&lt;/SPAN&gt; sentence_transformers &lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; SentenceTransformer
&lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; pandas &lt;SPAN class="token token"&gt;as&lt;/SPAN&gt; pd

&lt;SPAN class="token token"&gt;class&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;CustomSentenceTransformerModel&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;mlflow&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;pyfunc&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;PythonModel&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;

    &lt;SPAN class="token token"&gt;def&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;load_context&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;self&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; context&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
        self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;model &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; SentenceTransformer&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;'all-MiniLM-L6-v2'&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;

    &lt;SPAN class="token token"&gt;def&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;preprocess&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;self&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; row&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;# Custom preprocessing - join columns, etc.&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;return&lt;/SPAN&gt; &lt;SPAN class="token token string-interpolation"&gt;f"&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;{&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation"&gt;row&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation"&gt;'field1'&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;]&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;}&lt;/SPAN&gt; &lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;{&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation"&gt;row&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation"&gt;'field2'&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;]&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;}&lt;/SPAN&gt; &lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;{&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation"&gt;row&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation"&gt;'field3'&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;]&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;}&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation"&gt;"&lt;/SPAN&gt;

    &lt;SPAN class="token token"&gt;def&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;predict&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;self&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; context&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; model_input&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;# Accept DataFrame, Series, or list&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;# If DataFrame, apply preprocessing&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;if&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;isinstance&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;model_input&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; pd&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;DataFrame&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
            texts &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; model_input&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;apply&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;preprocess&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; axis&lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;1&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;tolist&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;elif&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;isinstance&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;model_input&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;list&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
            texts &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token token punctuation"&gt;[&lt;/SPAN&gt;self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;preprocess&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;x&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;if&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;isinstance&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;x&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;dict&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;else&lt;/SPAN&gt; x &lt;SPAN class="token token"&gt;for&lt;/SPAN&gt; x &lt;SPAN class="token token"&gt;in&lt;/SPAN&gt; model_input&lt;SPAN class="token token punctuation"&gt;]&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;else&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
            texts &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token token punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;str&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;model_input&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;]&lt;/SPAN&gt;

        &lt;SPAN class="token token"&gt;return&lt;/SPAN&gt; self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;model&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;encode&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;texts&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Key Points for Indexing Tables&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;When serving/inferencing,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;the input must be a DataFrame, array, or compatible structure&lt;/STRONG&gt;; MLflow Model Serving expects this.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;If you want to process tables,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;accept a DataFrame in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;&lt;/STRONG&gt;, preprocess each row, and then encode.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;All logic for optional preprocessing must be inside&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Troubleshooting the "Index creation failed" Error&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The error likely means&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;does not consume the input structure as expected, or the output is not serializable.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Ensure you&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;return standard Python objects (lists, arrays, DataFrames)&lt;/STRONG&gt;; avoid returning custom objects or types that cannot be serialized easily.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Check that your model serving environment has all dependencies (&lt;CODE&gt;sentence-transformers&lt;/CODE&gt;, etc.).&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Final Recommendations&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Implement only&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;load_context&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;, where&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;handles any preprocessing.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Return vector outputs in formats compatible with downstream tooling (usually NumPy arrays or lists).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Test your model locally first:&lt;/STRONG&gt;&lt;/P&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;&lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; pandas &lt;SPAN class="token token"&gt;as&lt;/SPAN&gt; pd  
data &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; pd&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;DataFrame&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;{&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"field1"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;"hello"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;"field2"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;3&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;"field3"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;4.2&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;}&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;]&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
model&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;predict&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token boolean"&gt;None&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; data&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Fri, 07 Nov 2025 16:49:23 GMT</pubDate>
    <dc:creator>mark_ott</dc:creator>
    <dc:date>2025-11-07T16:49:23Z</dc:date>
    <item>
      <title>Custom sentence transformer for indexing</title>
      <link>https://community.databricks.com/t5/generative-ai/custom-sentence-transformer-for-indexing/m-p/112506#M791</link>
      <description>&lt;P&gt;Hi!&amp;nbsp;&lt;/P&gt;&lt;P&gt;i would like to use my own sentence transformer to create a vector index.&amp;nbsp;&lt;/P&gt;&lt;P&gt;It is not a problem using mlflow sentence-transformer flavour, it works fine with:&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;mlflow.sentence_transformers.log_model(
    model,
    artifact_path="model",
    signature=signature,
    input_example=sentences,
    registered_model_name=registered_model_name)
  
  model_uri = f"runs:/{run.info.run_id}/model"
  registered_model = mlflow.register_model(
        model_uri=model_uri,
        name=registered_model_name
    )&lt;/LI-CODE&gt;&lt;P&gt;What i want to use is a pyfunc flavour because i want to add a optional preprocessing step as addtional functional that is glued to a model.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Unfortunatly i can't find any documentation or reference on what methods should custom mlflow.pyfunc.PythonModel implement.&amp;nbsp;&lt;BR /&gt;i tired something like this:&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import mlflow.pyfunc
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
class MyDataModel(BaseModel):
    field1: str
    field2: int
    field3: float

def process_object(obj: MyDataModel) -&amp;gt; str:
    return f"{obj.field1} {obj.field2} {obj.field3}"

class CustomSentenceTransformerModel(mlflow.pyfunc.PythonModel):

    def load_context(self, context):
        # Load the Sentence Transformer model
        self.model = SentenceTransformer('all-MiniLM-L6-v2')

    def process_object(self, obj:MyDataModel):
        # Define your custom processing here
        return f"Processed object with value: {obj}"

    def predict(self, context, model_input):
        # This method is required for MLflow's pyfunc models
        return self.model.encode(model_input)
    
    def encode(self, input):
        return self.model.encode(input)&lt;/LI-CODE&gt;&lt;P&gt;Yet is is not possible to use it for indexing tables.&amp;nbsp;&lt;BR /&gt;I know that i can just run a notebook that will create a new column with vector embeddings, but thats not the point here.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I just get error:&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Index creation failed
Failed to call Model Serving endpoint: embedding_pyfunc.&lt;/LI-CODE&gt;&lt;P&gt;Without any justification/logs anything!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 13 Mar 2025 16:39:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/custom-sentence-transformer-for-indexing/m-p/112506#M791</guid>
      <dc:creator>Ulfzerk</dc:creator>
      <dc:date>2025-03-13T16:39:46Z</dc:date>
    </item>
    <item>
      <title>Re: Custom sentence transformer for indexing</title>
      <link>https://community.databricks.com/t5/generative-ai/custom-sentence-transformer-for-indexing/m-p/138151#M1346</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;To use a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;custom MLflow pyfunc model&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;for sentence-transformers&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;with preprocessing&lt;/STRONG&gt;, you need to comply with the expected interface of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;mlflow.pyfunc.PythonModel&lt;/CODE&gt;, especially the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;method. The method signature, data handling, and serialization are key points. Below is a direct answer with practical explanation and guidelines.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Required Methods for mlflow.pyfunc.PythonModel&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;only method you must implement is&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict(self, context, model_input)&lt;/CODE&gt;&lt;/STRONG&gt;.&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;CODE&gt;context&lt;/CODE&gt;: MLflow-provided info (artifacts, configs, etc.).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;CODE&gt;model_input&lt;/CODE&gt;: The input passed during inference (usually Pandas DataFrame, NumPy array, or Python native types).&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Guidelines and Typical Pattern&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Load everything needed in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;load_context&lt;/CODE&gt;, which runs once when the model is loaded by MLflow.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Accept both batch (DataFrame/array) and single-input cases in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The output of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;should be directly serializable (ideally array-like or DataFrame).&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Example Template&lt;/H2&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;&lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; mlflow&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;pyfunc
&lt;SPAN class="token token"&gt;from&lt;/SPAN&gt; sentence_transformers &lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; SentenceTransformer
&lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; pandas &lt;SPAN class="token token"&gt;as&lt;/SPAN&gt; pd

&lt;SPAN class="token token"&gt;class&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;CustomSentenceTransformerModel&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;mlflow&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;pyfunc&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;PythonModel&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;

    &lt;SPAN class="token token"&gt;def&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;load_context&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;self&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; context&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
        self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;model &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; SentenceTransformer&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;'all-MiniLM-L6-v2'&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;

    &lt;SPAN class="token token"&gt;def&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;preprocess&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;self&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; row&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;# Custom preprocessing - join columns, etc.&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;return&lt;/SPAN&gt; &lt;SPAN class="token token string-interpolation"&gt;f"&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;{&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation"&gt;row&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation"&gt;'field1'&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;]&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;}&lt;/SPAN&gt; &lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;{&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation"&gt;row&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation"&gt;'field2'&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;]&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;}&lt;/SPAN&gt; &lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;{&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation"&gt;row&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation"&gt;'field3'&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;]&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation interpolation punctuation"&gt;}&lt;/SPAN&gt;&lt;SPAN class="token token string-interpolation"&gt;"&lt;/SPAN&gt;

    &lt;SPAN class="token token"&gt;def&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;predict&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;self&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; context&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; model_input&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;# Accept DataFrame, Series, or list&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;# If DataFrame, apply preprocessing&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;if&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;isinstance&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;model_input&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; pd&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;DataFrame&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
            texts &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; model_input&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;apply&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;preprocess&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; axis&lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;1&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;tolist&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;elif&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;isinstance&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;model_input&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;list&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
            texts &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token token punctuation"&gt;[&lt;/SPAN&gt;self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;preprocess&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;x&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;if&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;isinstance&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;x&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;dict&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;else&lt;/SPAN&gt; x &lt;SPAN class="token token"&gt;for&lt;/SPAN&gt; x &lt;SPAN class="token token"&gt;in&lt;/SPAN&gt; model_input&lt;SPAN class="token token punctuation"&gt;]&lt;/SPAN&gt;
        &lt;SPAN class="token token"&gt;else&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
            texts &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token token punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;str&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;model_input&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;]&lt;/SPAN&gt;

        &lt;SPAN class="token token"&gt;return&lt;/SPAN&gt; self&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;model&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;encode&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;texts&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Key Points for Indexing Tables&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;When serving/inferencing,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;the input must be a DataFrame, array, or compatible structure&lt;/STRONG&gt;; MLflow Model Serving expects this.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;If you want to process tables,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;accept a DataFrame in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;&lt;/STRONG&gt;, preprocess each row, and then encode.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;All logic for optional preprocessing must be inside&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Troubleshooting the "Index creation failed" Error&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The error likely means&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;does not consume the input structure as expected, or the output is not serializable.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Ensure you&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;return standard Python objects (lists, arrays, DataFrames)&lt;/STRONG&gt;; avoid returning custom objects or types that cannot be serialized easily.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Check that your model serving environment has all dependencies (&lt;CODE&gt;sentence-transformers&lt;/CODE&gt;, etc.).&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Final Recommendations&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Implement only&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;load_context&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;, where&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;predict&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;handles any preprocessing.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Return vector outputs in formats compatible with downstream tooling (usually NumPy arrays or lists).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Test your model locally first:&lt;/STRONG&gt;&lt;/P&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;&lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; pandas &lt;SPAN class="token token"&gt;as&lt;/SPAN&gt; pd  
data &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; pd&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;DataFrame&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;[&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;{&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"field1"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;"hello"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;"field2"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;3&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;"field3"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;4.2&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;}&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;]&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
model&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;predict&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token boolean"&gt;None&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; data&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Fri, 07 Nov 2025 16:49:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/custom-sentence-transformer-for-indexing/m-p/138151#M1346</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-11-07T16:49:23Z</dc:date>
    </item>
  </channel>
</rss>

