<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Unable to install poppler-utils in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/100983#M40499</link>
    <description>&lt;P&gt;I am using my personal cluster but still getting the same error&lt;/P&gt;</description>
    <pubDate>Wed, 04 Dec 2024 22:26:54 GMT</pubDate>
    <dc:creator>Arunraja</dc:creator>
    <dc:date>2024-12-04T22:26:54Z</dc:date>
    <item>
      <title>Unable to install poppler-utils</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/40117#M27129</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I'm trying to install system level package "Poppler-utils" for the cluster. I added the following line to the init.sh script.&lt;/P&gt;&lt;P&gt;sudo apt-get -f -y install poppler-utils&lt;/P&gt;&lt;P&gt;I got the following error:&amp;nbsp;&lt;EM&gt;&lt;SPAN class=""&gt;PDFInfoNotInstalledError&lt;/SPAN&gt;: Unable to get page count. Is poppler installed and in PATH?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;If I install the same line at the notebook level, I don't get this error.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can anyone help me with this issue and how to install system level packages at the cluster level in init scripts?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 16 Aug 2023 19:30:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/40117#M27129</guid>
      <dc:creator>Deloitte_DS</dc:creator>
      <dc:date>2023-08-16T19:30:00Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to install poppler-utils</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/40482#M27210</link>
      <description>&lt;P&gt;Hi Kaniz, I tried to include it in the init script but still it is showing the same error. The path I gave is "usr/bin". May I know how I can navigate to this path to check if my package is installed or not? Also want to know how i can navigate to databricks/bin/python? Also how to check the environment variables?&lt;/P&gt;</description>
      <pubDate>Fri, 18 Aug 2023 16:36:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/40482#M27210</guid>
      <dc:creator>Deloitte_DS</dc:creator>
      <dc:date>2023-08-18T16:36:16Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to install poppler-utils</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/89177#M37715</link>
      <description>&lt;DIV&gt;&lt;DIV&gt;use a personal cluster and use&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;!&lt;/SPAN&gt;&lt;SPAN&gt;sudo apt&lt;/SPAN&gt;&lt;SPAN&gt;-&lt;/SPAN&gt;&lt;SPAN&gt;get update&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;and&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;!&lt;/SPAN&gt;&lt;SPAN&gt;sudo apt&lt;/SPAN&gt;&lt;SPAN&gt;-&lt;/SPAN&gt;&lt;SPAN&gt;get install &lt;/SPAN&gt;&lt;SPAN&gt;-&lt;/SPAN&gt;&lt;SPAN&gt;y poppler&lt;/SPAN&gt;&lt;SPAN&gt;-&lt;/SPAN&gt;&lt;SPAN&gt;utils&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Mon, 09 Sep 2024 12:07:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/89177#M37715</guid>
      <dc:creator>dheeraj-cir</dc:creator>
      <dc:date>2024-09-09T12:07:14Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to install poppler-utils</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/100983#M40499</link>
      <description>&lt;P&gt;I am using my personal cluster but still getting the same error&lt;/P&gt;</description>
      <pubDate>Wed, 04 Dec 2024 22:26:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/100983#M40499</guid>
      <dc:creator>Arunraja</dc:creator>
      <dc:date>2024-12-04T22:26:54Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to install poppler-utils</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/105698#M42244</link>
      <description>&lt;P&gt;Hi below worked for me, I created an init script for my compute with below code&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;sudo rm -r /var/lib/apt/lists/*&lt;BR /&gt;sudo apt clean &amp;amp;&amp;amp; sudo apt update --fix-missing -y&lt;BR /&gt;sudo apt-get install poppler-utils tesseract-ocr -y&lt;/P&gt;</description>
      <pubDate>Wed, 15 Jan 2025 11:49:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/105698#M42244</guid>
      <dc:creator>kbmv</dc:creator>
      <dc:date>2025-01-15T11:49:25Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to install poppler-utils</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/106570#M42522</link>
      <description>&lt;P&gt;Hi Team,&lt;/P&gt;
&lt;P&gt;If you&amp;nbsp;use a single user cluster and use the below init script, it will work:&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;sudo rm -r /var/lib/apt/lists/* &lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;sudo apt clean &amp;amp;&amp;amp; sudo apt update --fix-missing -y&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;sudo apt-get install poppler-utils tesseract-ocr -y&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;But if you are using a shared cluster. This solution would not work.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;RCA:&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Libraries installed via init scripts are not available for user-defined functions (UDFs) in Shared mode clusters, as the UDFs execute in a sandboxed execution environment. Python UDFs will not support dependencies installed via init scripts, as those are not available in the sandbox environment.&lt;/P&gt;
&lt;P&gt;We need init scripts for Poppler, especially in your case, as your code (for example, pdf2image) relies on Poppler’s command-line utilities.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Solution:&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;For single mode cluster: Use the current functionality.&lt;BR /&gt;For shared mode cluster: Users can consider using alternative Python libraries that provide similar functionality to poppler-utils. Two such libraries are pdfplumber and PyMuPDF.&lt;/P&gt;
&lt;P&gt;PyMuPDF: Screenshot 1&lt;BR /&gt;Pdfplumber: Screenshot 2&lt;/P&gt;
&lt;P&gt;Hope this helps you!&lt;/P&gt;
&lt;DIV id="tinyMceEditor_35d109ae978100Raghavan93513_0" class="mceNonEditable lia-copypaste-placeholder"&gt;&amp;nbsp;&lt;/DIV&gt;</description>
      <pubDate>Wed, 22 Jan 2025 04:08:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-install-poppler-utils/m-p/106570#M42522</guid>
      <dc:creator>Raghavan93513</dc:creator>
      <dc:date>2025-01-22T04:08:56Z</dc:date>
    </item>
  </channel>
</rss>

