Bonjour,
J’essaie actuellement d’entraîner un réseau de neurones convolutifs (pour un premier entraînement court, en environnement interactif via Jupyter-lab) sur la partition GPU mais me retrouve face à un problème dont je n’arrive pas à m’extirper.
Je me permet de vous donner ci-dessous les informations qui me paraissent utiles :
Profil demandé : 2 x 8 CPU / 1 x ‘7g.40g’ GPU (pour tester si les limitations viennent d’une manque de mémoire?)
Kernel : Python 3.9 (également testé sous Python 3.7)
CNN utilisé : YOLOv5 (object detection - YOLOv5 - Github)
Dataset utilisé : 28 651 images, redimensionnées en 512x512 à la volée – taille sur disque 4,9 Go
Problème : Lors du démarrage de l’entraînement du réseau de neurones, une étape de mise en cache est possible (prévue pour accélérer l’entraînement par la suite). Compte tenue du poids relativement réduit de mes données (pouvant donc tenir en mémoire cache sans soucis normalement), je souhaiterai pouvoir utiliser cette fonctionnalité.
Cependant, lors de l’étape de mise en cache des données d’entraînement, elle s’interrompt (sans message d’erreur affiché), et le kernel passe de « busy » à « idle ». A priori cela arrive toujours à peu près lorsqu’environ 25 % de mes données sont chargées (environ 4-4,5 Go de RAM d’utilisées d’après les outputs de la fonction pour l’entraînement).
Ne pas utiliser cette fonction de mise en cache permet à l’entraînement de commencer, mais un problème similaire arrive après quelques instants : l'exécution de la cellule qui s’interrompt, le Kernel qui repasse en Idle, cette fois-ci avec un message d’erreur, malheuresement très peu utile (lorsque lancé depuis Python 3.9) :
RuntimeError: DataLoader worker (pid 18580) is killed by signal: Killed.
Cependant, lors de mes essais sous Python 3.7 (et installé les requirements de YOLOv5), j’obtiens l’erreur suivante :
Détail de l'erreur - (assez long)
Traceback (most recent call last):
File "train.py", line 642, in
main(opt)
File "train.py", line 531, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 312, in train
pred = model(imgs) # forward
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/shared/ifbstor1/projects/inbreeding_depression_measures/YOLOv5/yolov5/models/yolo.py", line 209, in forward
return self._forward_once(x, profile, visualize) # single-scale inference, train
File "/shared/ifbstor1/projects/inbreeding_depression_measures/YOLOv5/yolov5/models/yolo.py", line 121, in _forward_once
x = m(x) # run
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/shared/ifbstor1/projects/inbreeding_depression_measures/YOLOv5/yolov5/models/common.py", line 56, in forward
return self.act(self.bn(self.conv(x)))
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 179, in forward
self.eps,
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/nn/functional.py", line 2422, in batch_norm
input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 19608) is killed by signal: Killed.
Exception in thread Thread-3:
Traceback (most recent call last):
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/connection.py", line 499, in Client
deliver_challenge(c, authkey)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/connection.py", line 730, in deliver_challenge
response = connection.recv_bytes(256) # reject large message
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peerException in thread Thread-4:
Traceback (most recent call last):
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/connection.py", line 499, in Client
deliver_challenge(c, authkey)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/connection.py", line 730, in deliver_challenge
response = connection.recv_bytes(256) # reject large message
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peerException ignored in: <function _MultiProcessingDataLoaderIter.del at 0x2af8e3628170>
Traceback (most recent call last):
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1358, in del
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1283, in _shutdown_workers
AttributeError: 'NoneType' object has no attribute 'python_exit_status'
Exception ignored in: <function _MultiProcessingDataLoaderIter.del at 0x2af8e3628170>
Traceback (most recent call last):
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1358, in del
File "/shared/software/miniconda/envs/python-pytorch-tensorflow-3.7-1.11.0-2.6.0/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1283, in _shutdown_workers
AttributeError: 'NoneType' object has no attribute 'python_exit_status'
On retrouve le :
DataLoader worker (pid 19608) is killed by signal: Killed.
Mais il me semble également qu’il y a des problèmes de « Connection » (?) :
ConnectionResetError: [Errno 104] Connection reset by peer
Je ne crois pas avoir vu beaucoup d’autres topics sur des CNN sur le forum, donc j’imagine que ce n’est pas le genre de problèmes « communs », j’espère que quelqu’un aura quand même une idée.
J’en profite pour noter que depuis quelques jours, j’ai trois ou quatre :
Kernel Starting : 504 Gateway Timeout
... en démarrant des sessions Jupyter ; en soit re-sélectionner le Kernel et le relancer fini par régler le problème, mais si cela peut provir de là, je préfère signaler le problème.
Si d’autres infos sont nécessaires, n’hésitez pas !
En vous souhaitant une bonne journée et en vous remerciant d’avance,
Guillaume