<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Genie PDF export corrupts non-ASCII characters (Polish diacritics ł, ż, ś, ź, ę, ą) in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/genie-pdf-export-corrupts-non-ascii-characters-polish-diacritics/m-p/156604#M1803</link>
    <description>&lt;P&gt;When exporting a Genie conversation response to PDF, all&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Polish diacritical characters&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;are systematically replaced with wrong ASCII characters, making the document unreadable for Polish-speaking users.&lt;/P&gt;&lt;H3&gt;Character substitution pattern&lt;/H3&gt;&lt;DIV class=""&gt;&lt;DIV&gt;Expected Rendered as Example in PDF &lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;ł&lt;/TD&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;gBównych&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instead of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;głównych&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;ż&lt;/TD&gt;&lt;TD&gt;|&lt;/TD&gt;&lt;TD&gt;ró|nica&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instead of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;różnica&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;ś&lt;/TD&gt;&lt;TD&gt;[&lt;/TD&gt;&lt;TD&gt;bezpo\[rednie&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instead of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;bezpośrednie&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Ź&lt;/TD&gt;&lt;TD&gt;y&lt;/TD&gt;&lt;TD&gt;yródBa&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instead of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Źródła&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;ę&lt;/TD&gt;&lt;TD&gt;(dropped)&lt;/TD&gt;&lt;TD&gt;midzy&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instead of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;między&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;ą&lt;/TD&gt;&lt;TD&gt;(dropped)&lt;/TD&gt;&lt;TD&gt;rosn&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instead of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;rosną&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;This affects&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;all&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Polish characters across the entire document — titles, paragraphs, and table cells.&lt;/P&gt;&lt;H3&gt;PDF metadata &amp;amp; font analysis&lt;/H3&gt;&lt;P&gt;I analyzed the generated PDF with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;pypdf. Key findings:&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;P&gt;Producer: PDFium&lt;BR /&gt;Creator: PDFium&lt;/P&gt;&lt;P&gt;Fonts used:&lt;BR /&gt;/F1: BaseFont=/Helvetica, Subtype=/Type1, Encoding=/WinAnsiEncoding, ToUnicode=False, Embedded=False&lt;BR /&gt;/F2: BaseFont=/PMWIBM+SourceHanSansJP-Bold, Subtype=/Type0, Encoding=/Identity-H, ToUnicode=True, Embedded=False&lt;BR /&gt;/F5: BaseFont=/KUZLZL+SourceHanSansJP-Normal, Subtype=/Type0, Encoding=/Identity-H, ToUnicode=True, Embedded=False&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;H3&gt;Root cause analysis&lt;/H3&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;SourceHanSansJP&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;is a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Japanese&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;CJK font (JP = Japanese). It uses&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Identity-H&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;encoding with a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;ToUnicode&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;CMap.&lt;/LI&gt;&lt;LI&gt;The font is&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;not embedded&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in the PDF — only referenced.&lt;/LI&gt;&lt;LI&gt;The&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;ToUnicode&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;CMap appears to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;incorrectly map Polish diacritical glyphs&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(Latin Extended-A/B range: U+0141 ł, U+017B ż, U+015A ś, etc.) to wrong code points, producing the garbled output.&lt;/LI&gt;&lt;LI&gt;The&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;/Helvetica&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Type1 font with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;WinAnsiEncoding&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;could handle some Latin Extended characters, but the text is routed through the CJK font instead.&lt;/LI&gt;&lt;/OL&gt;&lt;H3&gt;Expected behavior&lt;/H3&gt;&lt;P&gt;Polish diacritical characters (ą, ć, ę, ł, ń, ó, ś, ź, ż) should render correctly in exported PDFs. These are standard Latin Extended-A characters (Unicode range U+0100–U+017F), supported by virtually all modern fonts.&lt;/P&gt;&lt;H3&gt;Steps to reproduce&lt;/H3&gt;&lt;OL&gt;&lt;LI&gt;Open a Genie space&lt;/LI&gt;&lt;LI&gt;Ask a question in Polish (or get a response containing Polish text)&lt;/LI&gt;&lt;LI&gt;Export the conversation/response to PDF&lt;/LI&gt;&lt;LI&gt;Open the PDF — all diacritical characters are corrupted&lt;/LI&gt;&lt;/OL&gt;&lt;H3&gt;Environment&lt;/H3&gt;&lt;UL&gt;&lt;LI&gt;Databricks on Azure&lt;/LI&gt;&lt;LI&gt;Genie (AI/BI) PDF export&lt;/LI&gt;&lt;LI&gt;PDF generated by PDFium engine&lt;/LI&gt;&lt;LI&gt;Language: Polish (likely affects other Central/Eastern European languages using Latin Extended: Czech, Hungarian, Romanian, etc.)&lt;/LI&gt;&lt;/UL&gt;</description>
    <pubDate>Mon, 11 May 2026 17:14:46 GMT</pubDate>
    <dc:creator>MWojcicki</dc:creator>
    <dc:date>2026-05-11T17:14:46Z</dc:date>
    <item>
      <title>Genie PDF export corrupts non-ASCII characters (Polish diacritics ł, ż, ś, ź, ę, ą)</title>
      <link>https://community.databricks.com/t5/generative-ai/genie-pdf-export-corrupts-non-ascii-characters-polish-diacritics/m-p/156604#M1803</link>
      <description>&lt;P&gt;When exporting a Genie conversation response to PDF, all&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Polish diacritical characters&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;are systematically replaced with wrong ASCII characters, making the document unreadable for Polish-speaking users.&lt;/P&gt;&lt;H3&gt;Character substitution pattern&lt;/H3&gt;&lt;DIV class=""&gt;&lt;DIV&gt;Expected Rendered as Example in PDF &lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;ł&lt;/TD&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;gBównych&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instead of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;głównych&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;ż&lt;/TD&gt;&lt;TD&gt;|&lt;/TD&gt;&lt;TD&gt;ró|nica&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instead of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;różnica&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;ś&lt;/TD&gt;&lt;TD&gt;[&lt;/TD&gt;&lt;TD&gt;bezpo\[rednie&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instead of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;bezpośrednie&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Ź&lt;/TD&gt;&lt;TD&gt;y&lt;/TD&gt;&lt;TD&gt;yródBa&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instead of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Źródła&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;ę&lt;/TD&gt;&lt;TD&gt;(dropped)&lt;/TD&gt;&lt;TD&gt;midzy&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instead of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;między&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;ą&lt;/TD&gt;&lt;TD&gt;(dropped)&lt;/TD&gt;&lt;TD&gt;rosn&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instead of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;rosną&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;This affects&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;all&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Polish characters across the entire document — titles, paragraphs, and table cells.&lt;/P&gt;&lt;H3&gt;PDF metadata &amp;amp; font analysis&lt;/H3&gt;&lt;P&gt;I analyzed the generated PDF with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;pypdf. Key findings:&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;P&gt;Producer: PDFium&lt;BR /&gt;Creator: PDFium&lt;/P&gt;&lt;P&gt;Fonts used:&lt;BR /&gt;/F1: BaseFont=/Helvetica, Subtype=/Type1, Encoding=/WinAnsiEncoding, ToUnicode=False, Embedded=False&lt;BR /&gt;/F2: BaseFont=/PMWIBM+SourceHanSansJP-Bold, Subtype=/Type0, Encoding=/Identity-H, ToUnicode=True, Embedded=False&lt;BR /&gt;/F5: BaseFont=/KUZLZL+SourceHanSansJP-Normal, Subtype=/Type0, Encoding=/Identity-H, ToUnicode=True, Embedded=False&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;H3&gt;Root cause analysis&lt;/H3&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;SourceHanSansJP&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;is a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Japanese&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;CJK font (JP = Japanese). It uses&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Identity-H&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;encoding with a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;ToUnicode&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;CMap.&lt;/LI&gt;&lt;LI&gt;The font is&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;not embedded&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in the PDF — only referenced.&lt;/LI&gt;&lt;LI&gt;The&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;ToUnicode&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;CMap appears to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;incorrectly map Polish diacritical glyphs&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(Latin Extended-A/B range: U+0141 ł, U+017B ż, U+015A ś, etc.) to wrong code points, producing the garbled output.&lt;/LI&gt;&lt;LI&gt;The&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;/Helvetica&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Type1 font with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;WinAnsiEncoding&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;could handle some Latin Extended characters, but the text is routed through the CJK font instead.&lt;/LI&gt;&lt;/OL&gt;&lt;H3&gt;Expected behavior&lt;/H3&gt;&lt;P&gt;Polish diacritical characters (ą, ć, ę, ł, ń, ó, ś, ź, ż) should render correctly in exported PDFs. These are standard Latin Extended-A characters (Unicode range U+0100–U+017F), supported by virtually all modern fonts.&lt;/P&gt;&lt;H3&gt;Steps to reproduce&lt;/H3&gt;&lt;OL&gt;&lt;LI&gt;Open a Genie space&lt;/LI&gt;&lt;LI&gt;Ask a question in Polish (or get a response containing Polish text)&lt;/LI&gt;&lt;LI&gt;Export the conversation/response to PDF&lt;/LI&gt;&lt;LI&gt;Open the PDF — all diacritical characters are corrupted&lt;/LI&gt;&lt;/OL&gt;&lt;H3&gt;Environment&lt;/H3&gt;&lt;UL&gt;&lt;LI&gt;Databricks on Azure&lt;/LI&gt;&lt;LI&gt;Genie (AI/BI) PDF export&lt;/LI&gt;&lt;LI&gt;PDF generated by PDFium engine&lt;/LI&gt;&lt;LI&gt;Language: Polish (likely affects other Central/Eastern European languages using Latin Extended: Czech, Hungarian, Romanian, etc.)&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Mon, 11 May 2026 17:14:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/genie-pdf-export-corrupts-non-ascii-characters-polish-diacritics/m-p/156604#M1803</guid>
      <dc:creator>MWojcicki</dc:creator>
      <dc:date>2026-05-11T17:14:46Z</dc:date>
    </item>
  </channel>
</rss>

