< Back to IRCAM Forum

RAVE Model Training Phase 2 Fail | Illegal Memory Access

Hi,

I was training a RAVE model with my own custom data. The system I am using has Nvidia RTX 2070 8GB VRAM and at first phase vram usage was around 50% everything was working fine. Just before adverserial phase (2nd phase) the traning failed with this error:

 torch.AcceleratorError: CUDA error: an illegal memory access was encountered

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

For debugging consider passing CUDA_LAUNCH_BLOCKING=1

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Epoch 6849:  32%|███▏      | 46/146 [00:07<00:17,  5.76it/s, v_num=2]   

The training config is v2_small so there should be enough memory for adverserial phase as well with 8GB VRAM.

Then I tried to continue the training from last checkpoint on my main system which has Nvidia RTX 5070 and it works fine now. I am just curious that is it really lack of memory error or something I can walkaround with RTX 2070 system?