Hello,
I have been using the bigmem partition for a few months now and have not run into any problems but now when I launch a job using srun, my run fails and I get the following error:
slurmstepd: error: *** STEP 17026592.0 ON cpu-node-69 CANCELLED AT 2021-06-03T19:50:39 ***
Any advice on why this is happening?
Thank you!
Brittany
Hello Brittany,
Nothing wrong from my point of view.
Job exit code is "0:9" (Killed by signal 9). Could be a consequence of an exceeded memory limit, but not sure.
Could you retry and provide us more information (sbatch/srun parameters, script path) ?
Hi,
Thank you for getting back to me!
Here is the script :
srun -c 16 -p bigmem -e chain_2.err -o chain_2.out --mem 300G -J 2_216S145 mpirun --oversubscribe -np 16 pb_mpi chain_2 &
Where I am running it :
/shared/projects/final_markers/phylobayes/nm/216S145F
Error message :
read data from file : CAT_216S145F.phy
number of taxa : 216
number of sites : 41177
number of states: 20
run started
slurmstepd: error: *** STEP 17073188.0 ON cpu-node-69 CANCELLED AT 2021-06-07T17:08:57 ***
chain_2.err (END)
I tried again and the same thing happens. It launches and starts normally and then about 5 hours later it is cancelled.
I don't think it is a memory issue, it says I only used 10% of the allotted memory :
[bbaker@clust-slurm-client 216S145F]$ seff 17073188
Job ID: 17073188
Cluster: core
User/Group: bbaker/bbaker
State: FAILED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 1-11:27:44
CPU Efficiency: 99.70% of 1-11:34:08 core-walltime
Job Wall-clock time: 02:13:23
Memory Utilized: 30.34 GB
Memory Efficiency: 10.11% of 300.00 GB
Thank you!
Brittany
Hi Brittany,
We see that your job on bigmem receive a signal 9 (SIGKILL).
But I agree, I don't think it's a memory issue.
I suspect trouble from MPI (version, compiler or other).
There is also 2 MPI jobs on the same node (bigmem) killed at the same time, and maybe there is a glitch here.
You jobs on bigmem
JobID JobName Partition ReqMem MaxRSS AllocCPUS State ExitCode Submit Timelimit Elapsed End
------------ ---------- ---------- ---------- ---------- ---------- ---------- -------- ------------------- ---------- ---------- -------------------
15708030 phylobaye+ bigmem 300Gn 32.73G 16 CANCELLED+ 0:0 2021-04-07T14:02:58 60-00:00:+ 56-20:40:43 2021-06-03T10:43:41
15709119 phylobaye+ bigmem 300Gn 32.73G 16 CANCELLED+ 0:0 2021-04-07T14:50:05 60-00:00:+ 56-19:53:26 2021-06-03T10:43:31
15709148 phylobaye+ bigmem 300Gn 32.52G 16 CANCELLED+ 0:0 2021-04-07T14:51:02 60-00:00:+ 52-17:02:07 2021-06-03T10:43:22
16034105 phylobaye+ bigmem 300Gn 16 FAILED 2:0 2021-04-21T13:18:24 60-00:00:+ 00:00:00 2021-05-07T14:31:21
16406383 phylobaye+ bigmem 300Gn 16 FAILED 2:0 2021-05-12T08:49:49 60-00:00:+ 00:00:00 2021-05-12T08:49:49
16406407 phylobaye+ bigmem 300Gn 32.73G 16 FAILED 7:0 2021-05-12T08:50:22 60-00:00:+ 6-04:05:56 2021-05-18T12:56:18
17024147 1_216S145 bigmem 250Gn 30.47G 16 FAILED 0:9 2021-06-03T10:49:31 60-00:00:+ 02:38:42 2021-06-03T13:28:13
17024150 2_216S145 bigmem 250Gn 28.95G 16 FAILED 0:9 2021-06-03T10:51:36 60-00:00:+ 02:36:37 2021-06-03T13:28:13
17024256 3_216S145 bigmem 250Gn 28.94G 16 FAILED 0:9 2021-06-03T10:53:13 60-00:00:+ 02:35:00 2021-06-03T13:28:13
17024287 4_216S145 bigmem 250Gn 29.62G 16 FAILED 0:9 2021-06-03T10:55:20 60-00:00:+ 02:32:53 2021-06-03T13:28:13
17026592 3_216S145 bigmem 250Gn 30.74G 16 FAILED 0:9 2021-06-03T16:56:44 60-00:00:+ 02:53:55 2021-06-03T19:50:39
17073134 2_216S145 bigmem 300Gn 16 FAILED 1:0 2021-06-07T14:52:50 60-00:00:+ 00:00:01 2021-06-07T14:52:51
17073148 2_216S145 bigmem 300Gn 28.19G 16 CANCELLED+ 0:0 2021-06-07T14:53:37 60-00:00:+ 00:01:37 2021-06-07T14:55:14
17073188 2_216S145 bigmem 300Gn 30.34G 16 FAILED 0:9 2021-06-07T14:55:35 60-00:00:+ 02:13:23 2021-06-07T17:08:58
17073228 4_216S145 bigmem 300Gn 29.00G 16 FAILED 0:9 2021-06-07T14:57:36 60-00:00:+ 02:11:22 2021-06-07T17:08:58
17094328 5_216S145 bigmem 300Gn 16 FAILED 2:0 2021-06-08T10:50:22 60-00:00:+ 00:00:00 2021-06-08T10:50:22
17094330 5_216S145 bigmem 300Gn 16 RUNNING 0:0 2021-06-08T10:50:33 60-00:00:+ 04:43:43 Unknown
So I don't see issue with bigmem or the cluster and suspect an issue with MPI or your environment.
Hope that can help you.
Okay, thank you for your help!
Best,
Brittany