Hi,
I was training a RAVE model with my own custom data. The system I am using has Nvidia RTX 2070 8GB VRAM and at first phase vram usage was around 50% everything was working fine. Just before adverserial phase (2nd phase) the traning failed with this error:
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Epoch 6849: 32%|███▏ | 46/146 [00:07<00:17, 5.76it/s, v_num=2]
The training config is v2_small so there should be enough memory for adverserial phase as well with 8GB VRAM.
Then I tried to continue the training from last checkpoint on my main system which has Nvidia RTX 5070 and it works fine now. I am just curious that is it really lack of memory error or something I can walkaround with RTX 2070 system?