Hi team,
I've tried to launch various jobs over the past days, but they stumble into the following error, or alike:
srun: job 34795653 queued and waiting for resources
srun: job 34795653 has been allocated resources
srun: error: Task launch for StepId=34795653.0 failed on node cpu-node-87: Unspecified error
srun: error: Application launch failed: Unspecified error
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
Do you have any idea what this could be caused by? My full command was for example:
srun --cpus-per-task=12 --mem=10GB --partition=long iqtree -s euk5_homs9.Kleisin.cdhit_40.mafft_merge.gappyout.drop.fa -B 1000 -T AUTO -m LG+C60 -pre euk5_homs9.Kleisin.cdhit_40.mafft_merge.gappyout.drop.iqtree &
I kindly thank you in advance for your help.
Best regards,
Jolien van Hooff
I get this type of errors..
34805387 long chloropl nbouche PD 0:00 1 (launch failed requeued held)
34804895 long metaprof nbouche PD 0:00 1 (launch failed requeued held)
34804892 long metaprof nbouche PD 0:00 1 (launch failed requeued held)
34804880 long metaprof nbouche PD 0:00 1 (launch failed requeued held)
34804878 long metaprof nbouche PD 0:00 1 (launch failed requeued held)
34804866 long metaprof nbouche PD 0:00 1 (launch failed requeued held)
34804852 long metaprof nbouche PD 0:00 1 (launch failed requeued held)
34804840 long metaprof nbouche PD 0:00 1 (launch failed requeued held)
34804825 long metaprof nbouche PD 0:00 1 (launch failed requeued held)
34804816 long metaprof nbouche PD 0:00 1 (launch failed requeued held)
I have the same issue...
My jobs are multi-thread. Is it your case too?
Lucas Bonometti
Some are, some are not.
I noticed several submission errors since that issue :
Maybe a little lag on the storage.
Error have just disappeared few minutes later.
Do you have always some issues ?
Yes, mine are multi-thread. I also tried to exclude cpu-node-87 with --exclude=cpu-node-87, and then the job goes into pending status (PD).
Thank you for this !!
It worked for me with --exclude=cpu-node-87. I also added --time=4-00:00
Have you tried limiting time (to make it stop before the maintenance) ?
1 « J'aime »
I still get the same error even when --exclude=cpu-node-87 and --time are set
Thanks for your feedback and you trial-and-error work.
Indeed, there was an error on cpu-node-87
(account management service error).
It's repaired.
I didn't find anything else. @nbouche if you could provide us more information (jobid, command to reproduce this error, etc), it could help us.
1 « J'aime »