Handling large-scale datasets efficiently is one of the biggest bottlenecks in modern machine learning workflows and when pre-training LLMs. As datasets grow in size and complexity, traditional methods like memmap arrays or PyTorch DataLoaders can struggle to keep up, especially in distributed training environments. In this post, we will look at some of the key features of Mosaic MDS, understand how to convert our existing data into this format, and walk through a few toy and real-world examples to get familiar with usage.
What is MDS?
The Mosaic ML team has put a lot of work into making some solid documentation, so I’d recommend you give that a read here. Mosaic Data Sharding is a data loading library that was introduced in 2023 with this announcement. The Mosaic data environment is set up to make training in distributed environments with large datasets as painless as possible. Indeed, there are many common issues when dealing with large datasets in large environments, such as:
These are just some of the issues we have dealt with when pre-training large language models at scale, but I’m sure there are others I’m forgetting to mention.
Mosaic Data Streaming was set up to remedy the above issues with an easy-to-use codebase that plays closely with the standard PyTorch dataloader syntax to make it extra easy use. The chart below shows a nice illustration of what can happen when changing the number of GPUs during a train cycle, which, with standard data loaders, can create nondeterminism. With the Mosaic DS package, however, we can see that the loss curve looks exactly the same regardless of the GPUs.
Figure: Taken from https://www.databricks.com/blog/mosaicml-streamingdataset
I won’t dwell too much on the benefits of the library here; there are plenty of resources out there that do so. For more information, check out the links provided above. For now, let's get into how to actually implement this code with some simple examples.
Converting a Single File to MDS
Let’s start out getting familiar with the MDS library by converting a single file to MDS format. We will use a code snippet taken from the MDS repo to start:
|
The above creates a simple random noise image and saves it to a local directory called “MDS_Blog”. The output will look like this:
Go ahead and open up the index.json file. You'll see some metadata related to the files, including the data type expected in each column.
Now we have some simple data created in MDS format. This can be transferred to S3 and then used in the Dataloader.
Let’s do something a little more complex, though. Let’s get into some dummy data for language modeling. If you’ve never seen the Nano-GPT repo, I’d highly recommend you give it a read. In this repo, a transformer model is coded from scratch and then trained on pre-tokenized data.
Let’s start off small, using a file containing concatenated works from Shakespeare located here. We will download this data, tokenize it with an open-source tokenizer, and then save it to two files containing validation data and training data.
|
Okay, so now you should see two files, train.bin and val.bin, located local to wherever the code above ran. These files are in a special numpy binary data format; they can be easily read in in the NanoGPT training loop as nummpy memmap arrays, so can easily load from local without eating up all your RAM, making them ideal for training models.
Now, let’s convert those train and val files we just made into MDS format for use with our DataLoader instances. We will use a slightly altered version of the code above:
|
You can run this function directly on the train.bin and val.bin files we just made, and you should see an output like:
Now, you can easily use the MDS files in a train loop. Please note that you could also directly use the .bin files in a train loop by defining a simple wrapper class that allows the dataloaders to interact directly with the binary files.
|
The above code block could be used in conjunction with the below snippet:
|
in a training loop. So, we are able to use Mosaic Composer easily with either the binary files directly, or with the MDS data.
If we upload our new Shakespeare MDS data to S3 with something like:
|
You should see your data in S3 now:
Okay great! So we’ve now completed two toy examples:
Cut to the chase
Toy examples are great and all, but what do we do when it’s finally time to work with a big dataset? Let’s move up to a larger dataset: Skylion007/openwebtext · Datasets at Hugging Face
This is also used in the NanoGPT repo, and we follow that here to allow you to continue to train an at scale replica of GPT2. Openwebtext is a close to real world replication of the dataset used to train GPT2, as the actual data was never open-sourced by “Open”AI. Anyways, let’s tokenize this in the same way we tokenized the smaller dataset; I’ll just point you to this file. Once you’ve run that, you should have locally stored train.bin and val.bin files that are much larger than the previous ones. This presents us with a unique challenge, how can we convert this into MDS in a reasonable amount of time and transfer it to S3?
Indeed, if you attempt to run the script in a linear loop you may just see the first proton decay. Clearly, we will need to parallelize this to some extent, both during conversion to MDS and during upload to S3
NOTE:
You can get some cheat codes from several of the dataset downloads from the Mosaic repo, for example here is the C4 dataset conversion. These examples differ from our situation, as they are converting un-tokenized text into MDS format. This means that you would need to tokenize this data during loading or at least prior to movement into your model. Notice how they use the PyTorch parallelization in conjunction with Huggingface datasets class. Also note how the Hugginface dataset class uses the streaming option; this is a bit of a double-edged sword, as it reduces the RAM/Disk storage requirements but is, in our experience, much slower than the non-streaming option. The non-streaming option creates large cache files (Apache arrow based) that can take up to 8x the storage size of the base dataset, so be careful (it’s gotten us before).
Okay, but let’s focus on the task at hand, converting our pre-tokenized binary files into MDS format in a time and compute efficient manner. We will use some parallel processing here:
|
With a process chunk function:
|
A brief explanation of what’s going on here:
After we complete this step, you will have a lot of subdirectories, each containing an index.json file and a bunch of MDS shards. The next crucial step in managing all this is to create a master index.json file. We can do this simply by running
|
This important command will take the index.json files from each subdirectory and merge them into one manager index.json file, so your final directory structure will look like:
Figure: Example direcotry structure with merged index.json files
Okay! Now you have workable data in MDS format, and your final task is to push this to S3. The function to push to S3 still needs to multiprocess for the transfer, but it’s fairly straightforward so I’ll save space and let you take a stab at it yourself.
Conclusion
You’ve now seen how to move data back and forth into MDS format and S3. MDS supports other major cloud providers as well, so if you have a different provider than AWS, it’s no problem. The MDS library provides many strong benefits to training large language models at scale, as our research team can attest to firsthand. Whether you are training on a few GPUs or hundreds, the library can support your tech stack and also many different data types. If you’re a multimodal crew, MDS supports easily mixing datasets together as well. We’ve only just scratched the surface of what the MDS library has to offer; hopefully this serves as a useful primer to get started with the library. If you would like clarification on any of the code used (or would like some of the helper functions we left out for brevity), don’t hesitate to reach out to our team for help. Until next time!
About Aimpoint Digital
Aimpoint Digital is a market-leading analytics firm at the forefront of solving the most complex business and economic challenges through data and analytical technology. From integrating self-service analytics to implementing AI at scale and modernizing data infrastructure environments, Aimpoint Digital operates across transformative domains to improve the performance of organizations. Learn more by visiting: https://www.aimpointdigital.com/
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.