ResourceExhaustedError: OOM

Marvin93 · Jul-15-2020, 05:21 PM

Hello everyone,

i am training a very small model multiple times in a row. I know this is probably not what most people do, but i just want to try a lot of different combinations of parameters. I am just doing that by using a couple of for loops and training the model over and over again. I am using Tensorboard and a dictionary to write the data in a csv file.

Now i am getting this error:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1000,1000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node training_492/SGD/gradients/dense_2194/MatMul_grad/MatMul_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

What i am wondering is how can that happen since the model is very small. I don't think the dictionary is getting to big, so i suppose every trained model is still saved in the memory. Even if i overwrite it in every for loop. Am i understanding that right or can anyone tell me the reason for that? Is there any smart solution for the problem?

Otherwise i am just splitting up the program on a bunch of smaller programs that train much less models each.

Regards
Marvin

ResourceExhaustedError: OOM

User Panel Messages

Announcements