<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Not able to read the file content completely using head in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/70234#M34033</link>
    <description>&lt;P&gt;For data that size using spark might be a good idea (although pure python would probably still work if the files are reasonable in size, 500MB might still work).&lt;BR /&gt;The amount of workers depends on if you will be using spark, or pure python. Python code will run on the driver so the amount of workers is irrelevant.&lt;BR /&gt;Spark however creates a task per file.&amp;nbsp; And a task uses a cpu.&lt;BR /&gt;Here is a &lt;A href="https://medium.com/@swethamurali03/apache-spark-executors-cba87f3de78d" target="_self"&gt;blog that gives you an idea how it works&lt;/A&gt;.&lt;/P&gt;</description>
    <pubDate>Wed, 22 May 2024 10:01:17 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2024-05-22T10:01:17Z</dc:date>
    <item>
      <title>Not able to read the file content completely using head</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/69221#M33864</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;We want to read the file content of the file and encode the content into base64. For that we have used below code&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;file_path = &lt;/SPAN&gt;&lt;SPAN class=""&gt;"/path/to/your/file.csv"&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;file_content = dbutils.fs.head(file_path, 512000000)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;encode_content = base64.b64encode(file_content.encode()).decode()&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;print(encode_content)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;File has 1700 records but using head we are getting only 232 records.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;But using above code file content is getting skipped for some bytes and we are not able to read the full data and encode it. Could you please provide the solution for this.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 17 May 2024 07:15:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/69221#M33864</guid>
      <dc:creator>saichandu_25</dc:creator>
      <dc:date>2024-05-17T07:15:54Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to read the file content completely using head</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/69224#M33867</link>
      <description>&lt;P&gt;the head function only returns a part of the file, that is what it does.&amp;nbsp; The maxbytes you can pass has an upper limit of 64K (&lt;STRONG&gt;head(file: java.lang.String, maxBytes: int = 65536): java.lang.String&lt;/STRONG&gt;).&lt;BR /&gt;You can read the file using spark (spark.read.csv) or plain python(using pandas or with open &amp;lt;file&amp;gt;), scala (using scala.io.Source&lt;/P&gt;</description>
      <pubDate>Fri, 17 May 2024 07:37:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/69224#M33867</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-05-17T07:37:12Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to read the file content completely using head</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/69225#M33868</link>
      <description>&lt;P&gt;Thanks for the update. Actual We want to read multiple file formats and we want to read the file content irrespective of file format so thats why we have head.&lt;BR /&gt;&lt;BR /&gt;With open is not working in notebook. How can we make that work?&lt;/P&gt;</description>
      <pubDate>Fri, 17 May 2024 07:43:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/69225#M33868</guid>
      <dc:creator>saichandu_25</dc:creator>
      <dc:date>2024-05-17T07:43:20Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to read the file content completely using head</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/69226#M33869</link>
      <description>&lt;P&gt;that is a built-in python function so it should work in a python notebook. You can also use pandas btw.&lt;BR /&gt;If you use a scala notebook you should use a scala/java library.&lt;BR /&gt;For SQL notebooks: use python/scala &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 17 May 2024 07:45:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/69226#M33869</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-05-17T07:45:29Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to read the file content completely using head</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/69236#M33871</link>
      <description>&lt;P&gt;If we use below code it is throwing error as file_path is not correct&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;file_path = &lt;/SPAN&gt;&lt;SPAN class=""&gt;"/dbfs/path/to/your/file.csv"&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;&amp;nbsp;filesystem&lt;/SPAN&gt; &lt;SPAN class=""&gt;with&lt;/SPAN&gt; &lt;SPAN class=""&gt;open&lt;/SPAN&gt;&lt;SPAN&gt;(file_path, &lt;/SPAN&gt;&lt;SPAN class=""&gt;'rb'&lt;/SPAN&gt;&lt;SPAN&gt;) &lt;/SPAN&gt;&lt;SPAN class=""&gt;as&lt;/SPAN&gt;&lt;SPAN&gt; f: content = f.read() &lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 17 May 2024 08:00:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/69236#M33871</guid>
      <dc:creator>saichandu_25</dc:creator>
      <dc:date>2024-05-17T08:00:57Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to read the file content completely using head</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/69240#M33872</link>
      <description>&lt;P&gt;you can use Volumes instead of dbfs:&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/connect/unity-catalog/volumes.html#what-path-is-used-for-accessing-files-in-a-volume" target="_blank"&gt;https://docs.databricks.com/en/connect/unity-catalog/volumes.html#what-path-is-used-for-accessing-files-in-a-volume&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 17 May 2024 08:12:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/69240#M33872</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-05-17T08:12:15Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to read the file content completely using head</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/70165#M34025</link>
      <description>&lt;P&gt;Hi, How can we read the 500MB or 1GB files using with open method in Databricks notebook?&lt;/P&gt;&lt;P&gt;Also if we need to read GB files how many worker nodes needed?&lt;/P&gt;</description>
      <pubDate>Tue, 21 May 2024 17:58:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/70165#M34025</guid>
      <dc:creator>saichandu_25</dc:creator>
      <dc:date>2024-05-21T17:58:39Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to read the file content completely using head</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/70234#M34033</link>
      <description>&lt;P&gt;For data that size using spark might be a good idea (although pure python would probably still work if the files are reasonable in size, 500MB might still work).&lt;BR /&gt;The amount of workers depends on if you will be using spark, or pure python. Python code will run on the driver so the amount of workers is irrelevant.&lt;BR /&gt;Spark however creates a task per file.&amp;nbsp; And a task uses a cpu.&lt;BR /&gt;Here is a &lt;A href="https://medium.com/@swethamurali03/apache-spark-executors-cba87f3de78d" target="_self"&gt;blog that gives you an idea how it works&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Wed, 22 May 2024 10:01:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/70234#M34033</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-05-22T10:01:17Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to read the file content completely using head</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/70439#M34065</link>
      <description>&lt;P&gt;Actually We want to read the files irrespective of its format.and push the files to Github Thats why we are going with 'with open' method but if we use with open method its not giving proper results after copying to Github.We need one solution&amp;nbsp; to read large files&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 23 May 2024 11:25:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/70439#M34065</guid>
      <dc:creator>saichandu_25</dc:creator>
      <dc:date>2024-05-23T11:25:52Z</dc:date>
    </item>
    <item>
      <title>Re: Not able to read the file content completely using head</title>
      <link>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/70501#M34071</link>
      <description>&lt;P&gt;I am curious what the use case if for wanting to load large files into github, which is a code repo.&lt;BR /&gt;Depending on the file format different parsing is necessary.&amp;nbsp; you could foresee logic for that in your program.&lt;/P&gt;</description>
      <pubDate>Thu, 23 May 2024 14:05:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/not-able-to-read-the-file-content-completely-using-head/m-p/70501#M34071</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-05-23T14:05:15Z</dc:date>
    </item>
  </channel>
</rss>

