- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-21-2021 11:26 PM
You can certainly store 300 million records without any problem.
The best option kinda depends on the use case.
If you want to do a lot of online querying on the table, I suggest using delta lake, which is optimeized (using z-order, bloom filter, partitioning and file pruning). With a Databricks SQL endpoint you can query the data.
If you want to use the data for data engineering (ETL jobs), I also suggest using delta lake so you can merge new/changed data incrementally.
You can use the same optimization techniques, but maybe using different columns (depending on which jobs read the table).
I do not know what the limits are concerning the amount of data. But billions of records should be no problem.
Of course everything depends on the cluster running your workload. A 4-node cluster will take longer to process this amount of data than a 20-node cluster.
So, if you can ingest your data incrementally: use delta lake, if you have to do a 300 million record overwrite every day, plain parquet is also OK.