Hello, I'm having a problem in the cluster since yesterday afternoon. When I send a job I have this error "(launch failed requeued held)". Can someone please explain what's the problem and what can I do to fix it?
Thanks a lot
Hello, I'm having a problem in the cluster since yesterday afternoon. When I send a job I have this error "(launch failed requeued held)". Can someone please explain what's the problem and what can I do to fix it?
Thanks a lot
Hello Julie,
Jobs can be "requeued" after launch failed.
In you case, it happens sometimes, a node was in error (up but not running correctly).
Slurm try to run your job on this idle node, but it goes wrong, and the job is "requeued held".
The failed node (cpu-node-35
) have been reboot. It's ok right now.
So nothing to do. It was a error on the server.
Thanks for reporting