Databricks Community

AzureDatabricks · ‎11-21-2021

How we can persist 300 million records? What is the best option to persist data databricks hive metastore/Azure storage/Delta table?

What is the limitations we have for deltatables of databricks in terms of data?

We have usecase where testers should be able to see data in table and query them.

-werners- · ‎11-21-2021

That I cannot do, there is no single ideal size/scenario.

However: the latest databricks version is a good choice (10.0 or latest LTS for production jobs).

For data jobs, the write optimized nodes are a good choice as they can use delta cache.

For online querying: databricks sql.

I myself use the cheapest node type which handles the job, and that depends on which spark program I run. So I use multiple cluster configurations.

I even run upsert jobs with a single worker on a table of over 300 million records, works fine depending on the amount of data which has to be rewritten.

It depends on filters, transformations etc on these 300 million records.

View solution in original post

-werners- · ‎11-21-2021

You can certainly store 300 million records without any problem.

The best option kinda depends on the use case.

If you want to do a lot of online querying on the table, I suggest using delta lake, which is optimeized (using z-order, bloom filter, partitioning and file pruning). With a Databricks SQL endpoint you can query the data.

If you want to use the data for data engineering (ETL jobs), I also suggest using delta lake so you can merge new/changed data incrementally.

You can use the same optimization techniques, but maybe using different columns (depending on which jobs read the table).

I do not know what the limits are concerning the amount of data. But billions of records should be no problem.

Of course everything depends on the cluster running your workload. A 4-node cluster will take longer to process this amount of data than a 20-node cluster.

So, if you can ingest your data incrementally: use delta lake, if you have to do a 300 million record overwrite every day, plain parquet is also OK.

AzureDatabricks · ‎11-21-2021

Thank you for answering.

Can you please suggest the best configuration to use for node clusters for 300 million records.

i.e. Node type ,Clusture mode and Databricks version runtime version?

-werners- · ‎11-21-2021

That I cannot do, there is no single ideal size/scenario.

However: the latest databricks version is a good choice (10.0 or latest LTS for production jobs).

For data jobs, the write optimized nodes are a good choice as they can use delta cache.

For online querying: databricks sql.

I myself use the cheapest node type which handles the job, and that depends on which spark program I run. So I use multiple cluster configurations.

I even run upsert jobs with a single worker on a table of over 300 million records, works fine depending on the amount of data which has to be rewritten.

It depends on filters, transformations etc on these 300 million records.

Hubert-Dudek · ‎11-22-2021

as delta is transactional, maybe if there is no concurrent writes to table, pure parquet correctly partitioned will be more efficient than delta to store that volume

SailajaB · ‎11-22-2021

Thank you for your reply!!

We have tested with 7+ lakh records data with different format options like saveAsTable, Parquet and Delta to persist the data in ADB. So that testers can use the data to run their scripts.

Timetaken to execute the above formats as mentioned below

saveAsTable -- 31.44 min

Parquet -- 38.78 min

delta -- 33.66 mins

Is there any way to write the processed datafame in persistant storage so that tester can view that as table and run sql queries( we are looking for where testers can modify the data like sample updates and inserts)

Thank you

-werners- · ‎11-22-2021

you can always write a df to persistent storage.

Just use spark.write.parquet (or whatever format you choose).

You can create a table view on top of the parquet file too if necessary and run sql queries on that (by using databricks notebooks or by odbc connection).