Runs failing on bigmem partition

Hello,

I have been using the bigmem partition for a few months now and have not run into any problems but now when I launch a job using srun, my run fails and I get the following error:

slurmstepd: error: *** STEP 17026592.0 ON cpu-node-69 CANCELLED AT 2021-06-03T19:50:39 ***

Any advice on why this is happening?

Thank you!

Brittany

Hello Brittany,

Nothing wrong from my point of view.
Job exit code is "0:9" (Killed by signal 9). Could be a consequence of an exceeded memory limit, but not sure.

Could you retry and provide us more information (sbatch/srun parameters, script path) ?

Hi,
Thank you for getting back to me!

Here is the script :
srun -c 16 -p bigmem -e chain_2.err -o chain_2.out --mem 300G -J 2_216S145 mpirun --oversubscribe -np 16 pb_mpi chain_2 &

Where I am running it :
/shared/projects/final_markers/phylobayes/nm/216S145F

Error message :
read data from file : CAT_216S145F.phy
number of taxa : 216
number of sites : 41177
number of states: 20
run started

slurmstepd: error: *** STEP 17073188.0 ON cpu-node-69 CANCELLED AT 2021-06-07T17:08:57 ***
chain_2.err (END)

I tried again and the same thing happens. It launches and starts normally and then about 5 hours later it is cancelled.

I don't think it is a memory issue, it says I only used 10% of the allotted memory :
[bbaker@clust-slurm-client 216S145F]$ seff 17073188
Job ID: 17073188
Cluster: core
User/Group: bbaker/bbaker
State: FAILED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 1-11:27:44
CPU Efficiency: 99.70% of 1-11:34:08 core-walltime
Job Wall-clock time: 02:13:23
Memory Utilized: 30.34 GB
Memory Efficiency: 10.11% of 300.00 GB

Thank you!
Brittany

Hi Brittany,

We see that your job on bigmem receive a signal 9 (SIGKILL).

But I agree, I don't think it's a memory issue.

I suspect trouble from MPI (version, compiler or other).
There is also 2 MPI jobs on the same node (bigmem) killed at the same time, and maybe there is a glitch here.

You jobs on bigmem
       JobID    JobName  Partition     ReqMem     MaxRSS  AllocCPUS      State ExitCode              Submit  Timelimit    Elapsed                 End 
------------ ---------- ---------- ---------- ---------- ---------- ---------- -------- ------------------- ---------- ---------- ------------------- 
15708030     phylobaye+     bigmem      300Gn     32.73G         16 CANCELLED+      0:0 2021-04-07T14:02:58 60-00:00:+ 56-20:40:43 2021-06-03T10:43:41 
15709119     phylobaye+     bigmem      300Gn     32.73G         16 CANCELLED+      0:0 2021-04-07T14:50:05 60-00:00:+ 56-19:53:26 2021-06-03T10:43:31 
15709148     phylobaye+     bigmem      300Gn     32.52G         16 CANCELLED+      0:0 2021-04-07T14:51:02 60-00:00:+ 52-17:02:07 2021-06-03T10:43:22 
16034105     phylobaye+     bigmem      300Gn                    16     FAILED      2:0 2021-04-21T13:18:24 60-00:00:+   00:00:00 2021-05-07T14:31:21 
16406383     phylobaye+     bigmem      300Gn                    16     FAILED      2:0 2021-05-12T08:49:49 60-00:00:+   00:00:00 2021-05-12T08:49:49 
16406407     phylobaye+     bigmem      300Gn     32.73G         16     FAILED      7:0 2021-05-12T08:50:22 60-00:00:+ 6-04:05:56 2021-05-18T12:56:18 
17024147      1_216S145     bigmem      250Gn     30.47G         16     FAILED      0:9 2021-06-03T10:49:31 60-00:00:+   02:38:42 2021-06-03T13:28:13 
17024150      2_216S145     bigmem      250Gn     28.95G         16     FAILED      0:9 2021-06-03T10:51:36 60-00:00:+   02:36:37 2021-06-03T13:28:13 
17024256      3_216S145     bigmem      250Gn     28.94G         16     FAILED      0:9 2021-06-03T10:53:13 60-00:00:+   02:35:00 2021-06-03T13:28:13 
17024287      4_216S145     bigmem      250Gn     29.62G         16     FAILED      0:9 2021-06-03T10:55:20 60-00:00:+   02:32:53 2021-06-03T13:28:13 
17026592      3_216S145     bigmem      250Gn     30.74G         16     FAILED      0:9 2021-06-03T16:56:44 60-00:00:+   02:53:55 2021-06-03T19:50:39 
17073134      2_216S145     bigmem      300Gn                    16     FAILED      1:0 2021-06-07T14:52:50 60-00:00:+   00:00:01 2021-06-07T14:52:51 
17073148      2_216S145     bigmem      300Gn     28.19G         16 CANCELLED+      0:0 2021-06-07T14:53:37 60-00:00:+   00:01:37 2021-06-07T14:55:14 
17073188      2_216S145     bigmem      300Gn     30.34G         16     FAILED      0:9 2021-06-07T14:55:35 60-00:00:+   02:13:23 2021-06-07T17:08:58 
17073228      4_216S145     bigmem      300Gn     29.00G         16     FAILED      0:9 2021-06-07T14:57:36 60-00:00:+   02:11:22 2021-06-07T17:08:58 
17094328      5_216S145     bigmem      300Gn                    16     FAILED      2:0 2021-06-08T10:50:22 60-00:00:+   00:00:00 2021-06-08T10:50:22 
17094330      5_216S145     bigmem      300Gn                    16    RUNNING      0:0 2021-06-08T10:50:33 60-00:00:+   04:43:43             Unknown

So I don't see issue with bigmem or the cluster and suspect an issue with MPI or your environment.
Hope that can help you.

Okay, thank you for your help!

Best,
Brittany