Hey,
Thinking of more alternates to repartition:
1- Use the limit
and offset
options in your SQL queries to export data in manageable chunks. For example, if you have a table with 100,000 rows and you want to export 10,000 rows at a time, you can use the following queries:
SELECT * FROM your_table LIMIT 10000 OFFSET 0;
SELECT * FROM your_table LIMIT 10000 OFFSET 10000;
SELECT * FROM your_table LIMIT 10000 OFFSET 20000;
...
(Adjust the LIMIT
and OFFSET
values based on the size of your data and the number of rows you want to export in each batch)
2- Save the results of each query to a separate CSV file. You can use the dbutils.fs
module to save the results to DBFS (Databricks File System) and then download them to your local machine.
3 - My personal fav is splitting, when working with heap dumps, you could use that too once you have the large file, split them up.
split -b 100M large.tar.gz large.tar.gz.part