cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Can we store 300 million records and what is the preferable compute type and config?

AzureDatabricks
New Contributor III

How we can persist 300 million records? What is the best option to persist data databricks hive metastore/Azure storage/Delta table?

What is the limitations we have for deltatables of databricks in terms of data?

We have usecase where testers should be able to see data in table and query them.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

That I cannot do, there is no single ideal size/scenario.

However: the latest databricks version is a good choice (10.0 or latest LTS for production jobs).

For data jobs, the write optimized nodes are a good choice as they can use delta cache.

For online querying: databricks sql.

I myself use the cheapest node type which handles the job, and that depends on which spark program I run. So I use multiple cluster configurations.

I even run upsert jobs with a single worker on a table of over 300 million records, works fine depending on the amount of data which has to be rewritten.

It depends on filters, transformations etc on these 300 million records.

View solution in original post

7 REPLIES 7

-werners-
Esteemed Contributor III

You can certainly store 300 million records without any problem.

The best option kinda depends on the use case.

If you want to do a lot of online querying on the table, I suggest using delta lake, which is optimeized (using z-order, bloom filter, partitioning and file pruning). With a Databricks SQL endpoint you can query the data.

If you want to use the data for data engineering (ETL jobs), I also suggest using delta lake so you can merge new/changed data incrementally.

You can use the same optimization techniques, but maybe using different columns (depending on which jobs read the table).

I do not know what the limits are concerning the amount of data. But billions of records should be no problem.

Of course everything depends on the cluster running your workload. A 4-node cluster will take longer to process this amount of data than a 20-node cluster.

So, if you can ingest your data incrementally: use delta lake, if you have to do a 300 million record overwrite every day, plain parquet is also OK.

Thank you for answering.

Can you please suggest the best configuration to use for node clusters for 300 million records.

i.e. Node type ,Clusture mode and Databricks version runtime version?

-werners-
Esteemed Contributor III

That I cannot do, there is no single ideal size/scenario.

However: the latest databricks version is a good choice (10.0 or latest LTS for production jobs).

For data jobs, the write optimized nodes are a good choice as they can use delta cache.

For online querying: databricks sql.

I myself use the cheapest node type which handles the job, and that depends on which spark program I run. So I use multiple cluster configurations.

I even run upsert jobs with a single worker on a table of over 300 million records, works fine depending on the amount of data which has to be rewritten.

It depends on filters, transformations etc on these 300 million records.

Hubert-Dudek
Esteemed Contributor III

as delta is transactional, maybe if there is no concurrent writes to table, pure parquet correctly partitioned will be more efficient than delta to store that volume

SailajaB
Valued Contributor III

Thank you for your reply!!

We have tested with 7+ lakh records data with different format options like saveAsTable, Parquet and Delta to persist the data in ADB. So that testers can use the data to run their scripts.

Timetaken to execute the above formats as mentioned below

saveAsTable -- 31.44 min

Parquet -- 38.78 min

delta -- 33.66 mins

Is there any way to write the processed datafame in persistant storage so that tester can view that as table and run sql queries( we are looking for where testers can modify the data like sample updates and inserts)

Thank you

-werners-
Esteemed Contributor III

you can always write a df to persistent storage.

Just use spark.write.parquet (or whatever format you choose).

You can create a table view on top of the parquet file too if necessary and run sql queries on that (by using databricks notebooks or by odbc connection).

thank you!!!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group