cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

Genie PDF export corrupts non-ASCII characters (Polish diacritics ł, ż, ś, ź, ę, ą)

MWojcicki
New Contributor

When exporting a Genie conversation response to PDF, all Polish diacritical characters are systematically replaced with wrong ASCII characters, making the document unreadable for Polish-speaking users.

Character substitution pattern

Expected Rendered as Example in PDF
łBgBównych instead of głównych
ż|ró|nica instead of różnica
ś[bezpo\[rednie instead of bezpośrednie
ŹyyródBa instead of Źródła
ę(dropped)midzy instead of między
ą(dropped)rosn instead of rosną
 
 

This affects all Polish characters across the entire document — titles, paragraphs, and table cells.

PDF metadata & font analysis

I analyzed the generated PDF with pypdf. Key findings:

 

Producer: PDFium
Creator: PDFium

Fonts used:
/F1: BaseFont=/Helvetica, Subtype=/Type1, Encoding=/WinAnsiEncoding, ToUnicode=False, Embedded=False
/F2: BaseFont=/PMWIBM+SourceHanSansJP-Bold, Subtype=/Type0, Encoding=/Identity-H, ToUnicode=True, Embedded=False
/F5: BaseFont=/KUZLZL+SourceHanSansJP-Normal, Subtype=/Type0, Encoding=/Identity-H, ToUnicode=True, Embedded=False

Root cause analysis

  1. SourceHanSansJP is a Japanese CJK font (JP = Japanese). It uses Identity-H encoding with a ToUnicode CMap.
  2. The font is not embedded in the PDF — only referenced.
  3. The ToUnicode CMap appears to incorrectly map Polish diacritical glyphs (Latin Extended-A/B range: U+0141 ł, U+017B ż, U+015A ś, etc.) to wrong code points, producing the garbled output.
  4. The /Helvetica Type1 font with WinAnsiEncoding could handle some Latin Extended characters, but the text is routed through the CJK font instead.

Expected behavior

Polish diacritical characters (ą, ć, ę, ł, ń, ó, ś, ź, ż) should render correctly in exported PDFs. These are standard Latin Extended-A characters (Unicode range U+0100–U+017F), supported by virtually all modern fonts.

Steps to reproduce

  1. Open a Genie space
  2. Ask a question in Polish (or get a response containing Polish text)
  3. Export the conversation/response to PDF
  4. Open the PDF — all diacritical characters are corrupted

Environment

  • Databricks on Azure
  • Genie (AI/BI) PDF export
  • PDF generated by PDFium engine
  • Language: Polish (likely affects other Central/Eastern European languages using Latin Extended: Czech, Hungarian, Romanian, etc.)
0 REPLIES 0