When I submit a job session to Slurm, sometimes it may not be displayed in the squeue and remains unexecuted forever, even though it returns a job ID. For instance, job 40758209. Thanks very much for checking on it.
This is another example. The job failed with ExitCode 0:53.
JobId=40798301 JobName=Alignment.sh
UserId=wcheng(166765) GroupId=wcheng(166765) MCS_label=N/A
Priority=14506249 Nice=0 Account=chengw_wgs QOS=normal
JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:53
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2024-07-16T09:08:50 EligibleTime=2024-07-16T09:08:50
AccrueTime=2024-07-16T09:08:50
StartTime=2024-07-16T09:08:50 EndTime=2024-07-16T09:08:50 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-07-16T09:08:50 Scheduler=Main
Partition=fast AllocNode:Sid=clust-slurm-client3:2147993
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cpu-node-55
BatchHost=cpu-node-55
NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0::
TRES=cpu=20,mem=50G,node=1,billing=20
Socks/Node=* NtasksPerN:B:S:C=20:0:: CoreSpec=*
MinCPUsNode=20 MinMemoryNode=50G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/shared/ifbstor1/projects/chengw_wgs/Codes/Alignment.sh --fq1 InputData/TRCC-35T_DHG01236_H2LFLCCXX_L8_1.fq.gz --fq2 InputData/TRCC-35T_DHG01236_H2LFLCCXX_L8_2.fq.gz --sample TRCC-35T --lane L8
WorkDir=/shared/ifbstor1/projects/chengw_wgs
StdErr=/shared/ifbstor1/projects/chengw_wgs/log/40798301.err
StdIn=/dev/null
StdOut=/shared/ifbstor1/projects/chengw_wgs/log/40798301.out
Power=
When I log into the node cpu-node-55, the directory /shared/projects is not accessible, and it will report "/shared/projects: broken symbolic link to /shared/ifbstor1/projects" when executing command "file /shared/projects"
It is the same in cpu-node-56, cpu-node-57. Please be careful
Indeed, sometimes storage or nodes are in bad state and automatically retired by the system, but sometimes not. Thanks for the report.
It's fully functional now.
The nodes cpu-node-94, cpu-node-95, cpu-node-97, cpu-node-98, cpu-node-100, and cpu-node-101 can not access /shared/projects and /shared/home now. Perhaps an automated script is required to check storage accessibility?
Indeed. Thanks @Wenxuan
We work on it !