Bonjour,
Il semblerait qu'il y ait un problème avec slurm en ce moment. Lorsque je tape des commandes simples comme squeue, sinfo, ou sbatch, j'obtiens comme erreur :
slurm_load_jobs error: Socket timed out on send/recv operation
Slurm fonctionnait très bien jusqu'à il y a une demi-heure environ mais j'avais des difficultés avec la librairie Joblib de python depuis ce matin pour faire tourner plusieurs jobs en parallèle (alors que cela fonctionnait très bien la semaine dernière) :
Erreur dans le fichier .out :
LokyProcess-1 failed with traceback:
Traceback (most recent call last):
File "/shared/ifbstor1/software/miniconda/envs/python-pytorch-tensorflow-3.9-1.11.0-2.6.2/lib/python3.9/site-packages/joblib/externals/loky/backend/popen_loky_posix.py", line 176, in
process_obj = pickle.load(from_parent)
File "/shared/ifbstor1/software/miniconda/envs/python-pytorch-tensorflow-3.9-1.11.0-2.6.2/lib/python3.9/site-packages/joblib/externals/loky/backend/synchronize.py", line 131, in setstate
self._semlock = _SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Erreurs dans le fichier .err :
File "/shared/ifbstor1/projects/ecoli_dated_cluster_phylogenies/1_Effective_pop_sizes_and_selection/mutation_scores.py", line 621, in
Parallel(n_jobs=nthreads)( ###### ADDED
File "/shared/ifbstor1/software/miniconda/envs/python-pytorch-tensorflow-3.9-1.11.0-2.6.2/lib/python3.9/site-packages/joblib/parallel.py", line 1950, in call
next(output)
File "/shared/ifbstor1/software/miniconda/envs/python-pytorch-tensorflow-3.9-1.11.0-2.6.2/lib/python3.9/site-packages/joblib/parallel.py", line 1588, in _get_outputs
self._start(iterator, pre_dispatch)
File "/shared/ifbstor1/software/miniconda/envs/python-pytorch-tensorflow-3.9-1.11.0-2.6.2/lib/python3.9/site-packages/joblib/parallel.py", line 1574, in _start
while self.dispatch_one_batch(iterator):
File "/shared/ifbstor1/software/miniconda/envs/python-pytorch-tensorflow-3.9-1.11.0-2.6.2/lib/python3.9/site-packages/joblib/parallel.py", line 1462, in dispatch_one_batch
self._dispatch(tasks)
File "/shared/ifbstor1/software/miniconda/envs/python-pytorch-tensorflow-3.9-1.11.0-2.6.2/lib/python3.9/site-packages/joblib/parallel.py", line 1384, in _dispatch
job = self._backend.apply_async(batch, callback=batch_tracker)
File "/shared/ifbstor1/software/miniconda/envs/python-pytorch-tensorflow-3.9-1.11.0-2.6.2/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 600, in apply_async
future = self._workers.submit(func)
File "/shared/ifbstor1/software/miniconda/envs/python-pytorch-tensorflow-3.9-1.11.0-2.6.2/lib/python3.9/site-packages/joblib/externals/loky/reusable_executor.py", line 225, in submit
return super().submit(fn, *args, **kwargs)
File "/shared/ifbstor1/software/miniconda/envs/python-pytorch-tensorflow-3.9-1.11.0-2.6.2/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 1226, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {EXIT(1), EXIT(1)}
/shared/ifbstor1/software/miniconda/envs/python-pytorch-tensorflow-3.9-1.11.0-2.6.2/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 1 leaked file objects to clean up at shutdown
warnings.warn(
Auriez-vous une idée quant aux causes de ces problèmes ?
Merci beaucoup et bonne journée,
Manolo Mischler