<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to optimize storage for sparse data in data lake? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-optimize-storage-for-sparse-data-in-data-lake/m-p/17877#M11800</link>
    <description>&lt;P&gt;datalake itself not, but the file format you use to store data does.&lt;/P&gt;&lt;P&gt;f.e. parquet uses column compression, so sparse data will compress pretty good.&lt;/P&gt;&lt;P&gt;csv on the other hand: total disaster&lt;/P&gt;</description>
    <pubDate>Thu, 08 Dec 2022 14:17:23 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2022-12-08T14:17:23Z</dc:date>
    <item>
      <title>How to optimize storage for sparse data in data lake?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-optimize-storage-for-sparse-data-in-data-lake/m-p/17876#M11799</link>
      <description>&lt;P&gt;I have lot of tables with 80% of columns being filled with nulls. I understand SQL sever provides a way to handle these kind of data during the data definition of the tables (with Sparse keyword).  Do datalake provide similar kind of thing?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 08 Dec 2022 13:08:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-optimize-storage-for-sparse-data-in-data-lake/m-p/17876#M11799</guid>
      <dc:creator>DB_developer</dc:creator>
      <dc:date>2022-12-08T13:08:20Z</dc:date>
    </item>
    <item>
      <title>Re: How to optimize storage for sparse data in data lake?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-optimize-storage-for-sparse-data-in-data-lake/m-p/17877#M11800</link>
      <description>&lt;P&gt;datalake itself not, but the file format you use to store data does.&lt;/P&gt;&lt;P&gt;f.e. parquet uses column compression, so sparse data will compress pretty good.&lt;/P&gt;&lt;P&gt;csv on the other hand: total disaster&lt;/P&gt;</description>
      <pubDate>Thu, 08 Dec 2022 14:17:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-optimize-storage-for-sparse-data-in-data-lake/m-p/17877#M11800</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-12-08T14:17:23Z</dc:date>
    </item>
    <item>
      <title>Re: How to optimize storage for sparse data in data lake?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-optimize-storage-for-sparse-data-in-data-lake/m-p/17878#M11801</link>
      <description>&lt;P&gt;Unless you compress the entire CSV, which also should be a viable approach.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;That said, Delta/Parquet would normally be the better option where each column in compressed.&lt;/P&gt;</description>
      <pubDate>Mon, 12 Dec 2022 08:15:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-optimize-storage-for-sparse-data-in-data-lake/m-p/17878#M11801</guid>
      <dc:creator>Håkon_Åmdal</dc:creator>
      <dc:date>2022-12-12T08:15:26Z</dc:date>
    </item>
  </channel>
</rss>

