Srun: error: Task launch failed: Unspecified error

jvanhooff · Août 8, 2023, 7:31

Hi team,

I've tried to launch various jobs over the past days, but they stumble into the following error, or alike:

srun: job 34795653 queued and waiting for resources
srun: job 34795653 has been allocated resources
srun: error: Task launch for StepId=34795653.0 failed on node cpu-node-87: Unspecified error
srun: error: Application launch failed: Unspecified error
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Do you have any idea what this could be caused by? My full command was for example:

srun --cpus-per-task=12 --mem=10GB --partition=long iqtree -s euk5_homs9.Kleisin.cdhit_40.mafft_merge.gappyout.drop.fa -B 1000 -T AUTO -m LG+C60 -pre euk5_homs9.Kleisin.cdhit_40.mafft_merge.gappyout.drop.iqtree &

I kindly thank you in advance for your help.

Best regards,
Jolien van Hooff

nbouche · Août 9, 2023, 8:23

I get this type of errors..

  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      34805387      long chloropl  nbouche PD       0:00      1 (launch failed requeued held)
      34804895      long metaprof  nbouche PD       0:00      1 (launch failed requeued held)
      34804892      long metaprof  nbouche PD       0:00      1 (launch failed requeued held)
      34804880      long metaprof  nbouche PD       0:00      1 (launch failed requeued held)
      34804878      long metaprof  nbouche PD       0:00      1 (launch failed requeued held)
      34804866      long metaprof  nbouche PD       0:00      1 (launch failed requeued held)
      34804852      long metaprof  nbouche PD       0:00      1 (launch failed requeued held)
      34804840      long metaprof  nbouche PD       0:00      1 (launch failed requeued held)
      34804825      long metaprof  nbouche PD       0:00      1 (launch failed requeued held)
      34804816      long metaprof  nbouche PD       0:00      1 (launch failed requeued held)

bonospora · Août 9, 2023, 8:34

Hello,

I have the same issue...
My jobs are multi-thread. Is it your case too?

Lucas Bonometti

nbouche · Août 9, 2023, 9:22

Some are, some are not.

I noticed several submission errors since that issue :

jvanhooff · Août 9, 2023, 9:29

Yes, mine are multi-thread. I also tried to exclude cpu-node-87 with --exclude=cpu-node-87, and then the job goes into pending status (PD).

bonospora · Août 9, 2023, 11:46

Thank you for this !!

It worked for me with --exclude=cpu-node-87. I also added --time=4-00:00

Have you tried limiting time (to make it stop before the maintenance) ?

Lucas

nbouche · Août 9, 2023, 12:38

I still get the same error even when --exclude=cpu-node-87 and --time are set

dbenaben · Août 16, 2023, 11:13

Hello,

Thanks for your feedback and you trial-and-error work.

Indeed, there was an error on cpu-node-87 (account management service error).
It's repaired.

I didn't find anything else. @nbouche if you could provide us more information (jobid, command to reproduce this error, etc), it could help us.