<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic DB vector search tutorial with GTE large - BPE tokenizer correct? in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/db-vector-search-tutorial-with-gte-large-bpe-tokenizer-correct/m-p/122444#M4126</link>
    <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I was trying to implement a vector search use case based on the databricks example notebook with GTE large:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/notebooks/source/generative-ai/vector-search-foundation-embedding-model-gte-example.html" target="_blank"&gt;https://docs.databricks.com/aws/en/notebooks/source/generative-ai/vector-search-foundation-embedding-model-gte-example.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;For chunking, the notebook uses BPE tokenization cl100kbase which is the same as in the equivalent example notebook which uses an openAI model.&lt;/P&gt;&lt;P&gt;Is this correct? I couldn't find any info on the used tokenization encoding in the original paper of GTE large as well as in the web. Does GTE large really also use BPE with exactly the same encoding as for the newer open AI models, or is this an error in the tutorial notebook? And one should rather use AutoTokenizer from huggingface?&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
    <pubDate>Sat, 21 Jun 2025 22:34:57 GMT</pubDate>
    <dc:creator>Kronos</dc:creator>
    <dc:date>2025-06-21T22:34:57Z</dc:date>
    <item>
      <title>DB vector search tutorial with GTE large - BPE tokenizer correct?</title>
      <link>https://community.databricks.com/t5/machine-learning/db-vector-search-tutorial-with-gte-large-bpe-tokenizer-correct/m-p/122444#M4126</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I was trying to implement a vector search use case based on the databricks example notebook with GTE large:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/notebooks/source/generative-ai/vector-search-foundation-embedding-model-gte-example.html" target="_blank"&gt;https://docs.databricks.com/aws/en/notebooks/source/generative-ai/vector-search-foundation-embedding-model-gte-example.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;For chunking, the notebook uses BPE tokenization cl100kbase which is the same as in the equivalent example notebook which uses an openAI model.&lt;/P&gt;&lt;P&gt;Is this correct? I couldn't find any info on the used tokenization encoding in the original paper of GTE large as well as in the web. Does GTE large really also use BPE with exactly the same encoding as for the newer open AI models, or is this an error in the tutorial notebook? And one should rather use AutoTokenizer from huggingface?&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Sat, 21 Jun 2025 22:34:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/db-vector-search-tutorial-with-gte-large-bpe-tokenizer-correct/m-p/122444#M4126</guid>
      <dc:creator>Kronos</dc:creator>
      <dc:date>2025-06-21T22:34:57Z</dc:date>
    </item>
    <item>
      <title>Re: DB vector search tutorial with GTE large - BPE tokenizer correct?</title>
      <link>https://community.databricks.com/t5/machine-learning/db-vector-search-tutorial-with-gte-large-bpe-tokenizer-correct/m-p/122805#M4135</link>
      <description>&lt;P&gt;I created an vector index using&amp;nbsp;databricks-gte-large-en and it was working fine.&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jun 2025 12:14:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/db-vector-search-tutorial-with-gte-large-bpe-tokenizer-correct/m-p/122805#M4135</guid>
      <dc:creator>MariuszK</dc:creator>
      <dc:date>2025-06-25T12:14:53Z</dc:date>
    </item>
    <item>
      <title>Re: DB vector search tutorial with GTE large - BPE tokenizer correct?</title>
      <link>https://community.databricks.com/t5/machine-learning/db-vector-search-tutorial-with-gte-large-bpe-tokenizer-correct/m-p/122949#M4137</link>
      <description>&lt;P&gt;I just found the definition and it is indeed word piece tokenization.&amp;nbsp;&lt;/P&gt;&lt;P&gt;So I think the tutorial is wrong.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5/blob/main/tokenizer.json" target="_blank"&gt;https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5/blob/main/tokenizer.json&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 26 Jun 2025 12:13:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/db-vector-search-tutorial-with-gte-large-bpe-tokenizer-correct/m-p/122949#M4137</guid>
      <dc:creator>Kronos</dc:creator>
      <dc:date>2025-06-26T12:13:44Z</dc:date>
    </item>
  </channel>
</rss>

