-werners-
Esteemed Contributor III

You can certainly store 300 million records without any problem.

The best option kinda depends on the use case.

If you want to do a lot of online querying on the table, I suggest using delta lake, which is optimeized (using z-order, bloom filter, partitioning and file pruning). With a Databricks SQL endpoint you can query the data.

If you want to use the data for data engineering (ETL jobs), I also suggest using delta lake so you can merge new/changed data incrementally.

You can use the same optimization techniques, but maybe using different columns (depending on which jobs read the table).

I do not know what the limits are concerning the amount of data. But billions of records should be no problem.

Of course everything depends on the cluster running your workload. A 4-node cluster will take longer to process this amount of data than a 20-node cluster.

So, if you can ingest your data incrementally: use delta lake, if you have to do a 300 million record overwrite every day, plain parquet is also OK.