cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

Genie PDF export corrupts non-ASCII characters (Polish diacritics ł, ż, ś, ź, ę, ą)

MWojcicki
New Contributor

When exporting a Genie conversation response to PDF, all Polish diacritical characters are systematically replaced with wrong ASCII characters, making the document unreadable for Polish-speaking users.

Character substitution pattern

Expected Rendered as Example in PDF
łBgBównych instead of głównych
ż|ró|nica instead of różnica
ś[bezpo\[rednie instead of bezpośrednie
ŹyyródBa instead of Źródła
ę(dropped)midzy instead of między
ą(dropped)rosn instead of rosną
 
 

This affects all Polish characters across the entire document — titles, paragraphs, and table cells.

PDF metadata & font analysis

I analyzed the generated PDF with pypdf. Key findings:

 

Producer: PDFium
Creator: PDFium

Fonts used:
/F1: BaseFont=/Helvetica, Subtype=/Type1, Encoding=/WinAnsiEncoding, ToUnicode=False, Embedded=False
/F2: BaseFont=/PMWIBM+SourceHanSansJP-Bold, Subtype=/Type0, Encoding=/Identity-H, ToUnicode=True, Embedded=False
/F5: BaseFont=/KUZLZL+SourceHanSansJP-Normal, Subtype=/Type0, Encoding=/Identity-H, ToUnicode=True, Embedded=False

Root cause analysis

  1. SourceHanSansJP is a Japanese CJK font (JP = Japanese). It uses Identity-H encoding with a ToUnicode CMap.
  2. The font is not embedded in the PDF — only referenced.
  3. The ToUnicode CMap appears to incorrectly map Polish diacritical glyphs (Latin Extended-A/B range: U+0141 ł, U+017B ż, U+015A ś, etc.) to wrong code points, producing the garbled output.
  4. The /Helvetica Type1 font with WinAnsiEncoding could handle some Latin Extended characters, but the text is routed through the CJK font instead.

Expected behavior

Polish diacritical characters (ą, ć, ę, ł, ń, ó, ś, ź, ż) should render correctly in exported PDFs. These are standard Latin Extended-A characters (Unicode range U+0100–U+017F), supported by virtually all modern fonts.

Steps to reproduce

  1. Open a Genie space
  2. Ask a question in Polish (or get a response containing Polish text)
  3. Export the conversation/response to PDF
  4. Open the PDF — all diacritical characters are corrupted

Environment

  • Databricks on Azure
  • Genie (AI/BI) PDF export
  • PDF generated by PDFium engine
  • Language: Polish (likely affects other Central/Eastern European languages using Latin Extended: Czech, Hungarian, Romanian, etc.)
1 REPLY 1

WiliamRosa
Databricks Partner

Hi @MWojcicki 

My understanding is that Genie space skills would not solve this issue.

The problem you described appears to be a **PDF rendering/export bug**, not something related to Genie instructions, skills, or table definitions.

Based on your technical analysis, it looks like the PDF export engine (**PDFium**) is routing **Latin Extended-A characters (U+0100–U+017F)** through a **Japanese CJK font (SourceHanSansJP)** instead of a font with proper Central/Eastern European language support. Since the font is not embedded in the generated PDF and the **ToUnicode CMap mapping appears incorrect**, Polish diacritical characters end up being rendered as wrong ASCII symbols or dropped entirely.

Since the Genie response itself is generated correctly and the corruption only happens during PDF export, this strongly suggests the issue is entirely in the rendering/export layer, after content generation.

Because of that, I don’t believe any Genie space customization (skills, instructions, semantic definitions, table configs, etc.) would have any influence over this behavior.

In my opinion, this needs to be fixed by the Databricks product/engineering team, likely in how the PDF export engine handles fonts and Unicode rendering.

I’d recommend opening a support case with Databricks, including your technical findings, as they’re very well documented and should help engineering triage the issue faster:

https://help.databricks.com/

This may also affect other languages using Latin Extended characters, not just Polish.

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa