The session is not shown and not executed after submiited to slurm

Wenxuan · Juillet 15, 2024, 9:37

When I submit a job session to Slurm, sometimes it may not be displayed in the squeue and remains unexecuted forever, even though it returns a job ID. For instance, job 40758209. Thanks very much for checking on it.

Wenxuan · Juillet 16, 2024, 9:13

This is another example. The job failed with ExitCode 0:53.
JobId=40798301 JobName=Alignment.sh
UserId=wcheng(166765) GroupId=wcheng(166765) MCS_label=N/A
Priority=14506249 Nice=0 Account=chengw_wgs QOS=normal
JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:53
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2024-07-16T09:08:50 EligibleTime=2024-07-16T09:08:50
AccrueTime=2024-07-16T09:08:50
StartTime=2024-07-16T09:08:50 EndTime=2024-07-16T09:08:50 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-07-16T09:08:50 Scheduler=Main
Partition=fast AllocNode:Sid=clust-slurm-client3:2147993
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cpu-node-55
BatchHost=cpu-node-55
NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0::
TRES=cpu=20,mem=50G,node=1,billing=20
Socks/Node=* NtasksPerN:B:S:C=20:0:: CoreSpec=*
MinCPUsNode=20 MinMemoryNode=50G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/shared/ifbstor1/projects/chengw_wgs/Codes/Alignment.sh --fq1 InputData/TRCC-35T_DHG01236_H2LFLCCXX_L8_1.fq.gz --fq2 InputData/TRCC-35T_DHG01236_H2LFLCCXX_L8_2.fq.gz --sample TRCC-35T --lane L8
WorkDir=/shared/ifbstor1/projects/chengw_wgs
StdErr=/shared/ifbstor1/projects/chengw_wgs/log/40798301.err
StdIn=/dev/null
StdOut=/shared/ifbstor1/projects/chengw_wgs/log/40798301.out
Power=

Wenxuan · Juillet 16, 2024, 9:19

When I log into the node cpu-node-55, the directory /shared/projects is not accessible, and it will report "/shared/projects: broken symbolic link to /shared/ifbstor1/projects" when executing command "file /shared/projects"

Wenxuan · Juillet 16, 2024, 9:57

It is the same in cpu-node-56, cpu-node-57. Please be careful

dbenaben · Juillet 22, 2024, 2:19

Indeed, sometimes storage or nodes are in bad state and automatically retired by the system, but sometimes not. Thanks for the report.

It's fully functional now.

Wenxuan · Juillet 25, 2024, 12:18

The nodes cpu-node-94, cpu-node-95, cpu-node-97, cpu-node-98, cpu-node-100, and cpu-node-101 can not access /shared/projects and /shared/home now. Perhaps an automated script is required to check storage accessibility?

dbenaben · Juillet 25, 2024, 1:06

Indeed. Thanks @Wenxuan
We work on it !