topic Re: How to optimize storage for sparse data in data lake? in Data Engineering

How to optimize storage for sparse data in data lake?

DB_developer — Thu, 08 Dec 2022 13:08:20 GMT

I have lot of tables with 80% of columns being filled with nulls. I understand SQL sever provides a way to handle these kind of data during the data definition of the tables (with Sparse keyword). Do datalake provide similar kind of thing?

Re: How to optimize storage for sparse data in data lake?

-werners- — Thu, 08 Dec 2022 14:17:23 GMT

datalake itself not, but the file format you use to store data does.

f.e. parquet uses column compression, so sparse data will compress pretty good.

csv on the other hand: total disaster

Re: How to optimize storage for sparse data in data lake?

Håkon_Åmdal — Mon, 12 Dec 2022 08:15:26 GMT

Unless you compress the entire CSV, which also should be a viable approach.

That said, Delta/Parquet would normally be the better option where each column in compressed.