Re: Can we store 300 million records and what is t...

-werners- · ‎11-21-2021

That I cannot do, there is no single ideal size/scenario.

However: the latest databricks version is a good choice (10.0 or latest LTS for production jobs).

For data jobs, the write optimized nodes are a good choice as they can use delta cache.

For online querying: databricks sql.

I myself use the cheapest node type which handles the job, and that depends on which spark program I run. So I use multiple cluster configurations.

I even run upsert jobs with a single worker on a table of over 300 million records, works fine depending on the amount of data which has to be rewritten.

It depends on filters, transformations etc on these 300 million records.

View solution in original post