Orianh
Valued Contributor II

Hey Guys,

While i was training i noticed two things that might cause the error.

The first one is after a training session was crashed, the GPU memory was almost full ( checked with nvidia smi command).

The second one is that i saw in gangila metrics a Swap above the total memory of the cluster.

In my use case i use make_reader from petastorm to read petastorm dataset and its default workers_count is 10, While i changed workers_count to 4 I didn't got any error.

I didn't figure out if I'm truly right and what the right way to overcome this,

Would like to hear you opnion,

Thanks!