Databricks Community

MWojcicki · 3 weeks ago

When exporting a Genie conversation response to PDF, all Polish diacritical characters are systematically replaced with wrong ASCII characters, making the document unreadable for Polish-speaking users.

Character substitution pattern

Expected Rendered as Example in PDF

ł	B	gBównych instead of głównych
ż	\|	ró\|nica instead of różnica
ś	[	bezpo\[rednie instead of bezpośrednie
Ź	y	yródBa instead of Źródła
ę	(dropped)	midzy instead of między
ą	(dropped)	rosn instead of rosną

This affects all Polish characters across the entire document — titles, paragraphs, and table cells.

PDF metadata & font analysis

I analyzed the generated PDF with pypdf. Key findings:

Producer: PDFium
Creator: PDFium

Fonts used:
/F1: BaseFont=/Helvetica, Subtype=/Type1, Encoding=/WinAnsiEncoding, ToUnicode=False, Embedded=False
/F2: BaseFont=/PMWIBM+SourceHanSansJP-Bold, Subtype=/Type0, Encoding=/Identity-H, ToUnicode=True, Embedded=False
/F5: BaseFont=/KUZLZL+SourceHanSansJP-Normal, Subtype=/Type0, Encoding=/Identity-H, ToUnicode=True, Embedded=False

Root cause analysis

SourceHanSansJP is a Japanese CJK font (JP = Japanese). It uses Identity-H encoding with a ToUnicode CMap.
The font is not embedded in the PDF — only referenced.
The ToUnicode CMap appears to incorrectly map Polish diacritical glyphs (Latin Extended-A/B range: U+0141 ł, U+017B ż, U+015A ś, etc.) to wrong code points, producing the garbled output.
The /Helvetica Type1 font with WinAnsiEncoding could handle some Latin Extended characters, but the text is routed through the CJK font instead.

Expected behavior

Polish diacritical characters (ą, ć, ę, ł, ń, ó, ś, ź, ż) should render correctly in exported PDFs. These are standard Latin Extended-A characters (Unicode range U+0100–U+017F), supported by virtually all modern fonts.

Steps to reproduce

Open a Genie space
Ask a question in Polish (or get a response containing Polish text)
Export the conversation/response to PDF
Open the PDF — all diacritical characters are corrupted

Environment

Databricks on Azure
Genie (AI/BI) PDF export
PDF generated by PDFium engine
Language: Polish (likely affects other Central/Eastern European languages using Latin Extended: Czech, Hungarian, Romanian, etc.)

WiliamRosa · 3 weeks ago

Hi @MWojcicki

My understanding is that Genie space skills would not solve this issue.

The problem you described appears to be a **PDF rendering/export bug**, not something related to Genie instructions, skills, or table definitions.

Based on your technical analysis, it looks like the PDF export engine (**PDFium**) is routing **Latin Extended-A characters (U+0100–U+017F)** through a **Japanese CJK font (SourceHanSansJP)** instead of a font with proper Central/Eastern European language support. Since the font is not embedded in the generated PDF and the **ToUnicode CMap mapping appears incorrect**, Polish diacritical characters end up being rendered as wrong ASCII symbols or dropped entirely.

Since the Genie response itself is generated correctly and the corruption only happens during PDF export, this strongly suggests the issue is entirely in the rendering/export layer, after content generation.

Because of that, I don’t believe any Genie space customization (skills, instructions, semantic definitions, table configs, etc.) would have any influence over this behavior.

In my opinion, this needs to be fixed by the Databricks product/engineering team, likely in how the PDF export engine handles fonts and Unicode rendering.

I’d recommend opening a support case with Databricks, including your technical findings, as they’re very well documented and should help engineering triage the issue faster:

https://help.databricks.com/

This may also affect other languages using Latin Extended characters, not just Polish.

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa