When exporting a Genie conversation response to PDF, all Polish diacritical characters are systematically replaced with wrong ASCII characters, making the document unreadable for Polish-speaking users.
Character substitution pattern
Expected Rendered as Example in PDF
| ł | B | gBównych instead of głównych |
| ż | | | ró|nica instead of różnica |
| ś | [ | bezpo\[rednie instead of bezpośrednie |
| Ź | y | yródBa instead of Źródła |
| ę | (dropped) | midzy instead of między |
| ą | (dropped) | rosn instead of rosną |
This affects all Polish characters across the entire document — titles, paragraphs, and table cells.
PDF metadata & font analysis
I analyzed the generated PDF with pypdf. Key findings:
Producer: PDFium
Creator: PDFium
Fonts used:
/F1: BaseFont=/Helvetica, Subtype=/Type1, Encoding=/WinAnsiEncoding, ToUnicode=False, Embedded=False
/F2: BaseFont=/PMWIBM+SourceHanSansJP-Bold, Subtype=/Type0, Encoding=/Identity-H, ToUnicode=True, Embedded=False
/F5: BaseFont=/KUZLZL+SourceHanSansJP-Normal, Subtype=/Type0, Encoding=/Identity-H, ToUnicode=True, Embedded=False
Root cause analysis
- SourceHanSansJP is a Japanese CJK font (JP = Japanese). It uses Identity-H encoding with a ToUnicode CMap.
- The font is not embedded in the PDF — only referenced.
- The ToUnicode CMap appears to incorrectly map Polish diacritical glyphs (Latin Extended-A/B range: U+0141 ł, U+017B ż, U+015A ś, etc.) to wrong code points, producing the garbled output.
- The /Helvetica Type1 font with WinAnsiEncoding could handle some Latin Extended characters, but the text is routed through the CJK font instead.
Expected behavior
Polish diacritical characters (ą, ć, ę, ł, ń, ó, ś, ź, ż) should render correctly in exported PDFs. These are standard Latin Extended-A characters (Unicode range U+0100–U+017F), supported by virtually all modern fonts.
Steps to reproduce
- Open a Genie space
- Ask a question in Polish (or get a response containing Polish text)
- Export the conversation/response to PDF
- Open the PDF — all diacritical characters are corrupted
Environment
- Databricks on Azure
- Genie (AI/BI) PDF export
- PDF generated by PDFium engine
- Language: Polish (likely affects other Central/Eastern European languages using Latin Extended: Czech, Hungarian, Romanian, etc.)