Nextflow pipeline with STAR FAILED without evident error

Hello dear Cluster Team,
I d like your help to trace the problem of my last series of jobs, that Failed but without any specific error message in the log files of the tool "STAR", or of "nextflow".
It's about

jobIDs = { 13466845 - 51}, -u mkondili, on 17.October.

My doubts are mainly on the memory demanded and used from tool.
The tool STAR, which I use for alignment of fastq-files, requires 40Gb of memory, which I declare in the process when I call it.
But then, in the Sbatch I declare the following :

#SBATCH -A mbnl_dct
#SBATCH --mem 40GB
#SBATCH -n 1
#SBATCH -c 8

on the contrary, for nextflow to launch and schedule the jobs in the cores, I need max 10Gb,
but when I tried with
#SBATCH --mem 10GB the jobs failed anyway.

So I have some questions too.
Is there a limit by default for very "demanding" tools, in the cluster, for which my script exceeds ?
Or an error in the way I run the sbatch /nextflow ?

I am not sure how many cores to ask for 40Gb. Should there be an equivalence of

cores x mem/core = total Sbatch --mem ?

Is the Sbatch --mem parameter referring to the memory used by nextflow ,or the total memory by all the processes happening in main pipeline ?
I should also learn how to calculate the memory each tool consumes, to optimise my scripts.
If you have any tools/commands in the cluster for that, please share.

Thanks in advance,
I hope I have explained enough to help you understand.
Let me know if you have further questions.

Hello @maria_myologie,

The --mem parameter of sbatch indicate the total amount fo RAM reserved for the job.
If you whish to have 40GB per CPU, you have to use the parameter --mem-per-cpu.

Do you have a file called slurm-13466845.err or slurm-13466845.out in your working directory. This could contain relevant information about why you job is failing.

Hello Julien !

The slurm.out file contains this multiple times, but no final error message :

executor > slurm (6)
[52/560488] process > STAR_Alignment (5) [ 0%] 0 of 6

The .err file was empty actually !

I am confused by the way to run my pipeline because in the nextflow script I also define the ressources I need, and the processes are done file -by-file, so they can be assigned to a core per file.

process STAR_Alignment {
cpus "${params.cpus}"
memory "40G"
module "star/2.7.5a:perl/5.26.2"
}

The params.cpu = 8 is given every time for one process, and is same as SBATCH -c 8.
Does this seem correct ?

Ping @team.workflow (désolé pour les snakemakeux pour la fausse alerte)

@maria_myologie
What do you think about opening this thread to allow more contributors?

Hello,
sure ,I don't mind,
It's only that I just re-run and this time, I got an error relative to the tool STAR and not to the cluster parameters or pipeline design...So it might be off topic ! Should I publish the error this time ?

Yes please :slight_smile:

In slurm.out :

> Command error:
> 
>   EXITING because of FATAL ERROR: number of bytes expected from the BAM bin does not agree with the actual size on disk: Expected bin size=538502234 ; size on disk=120863510 ; bin number=47

There is a discussion with the developer of STAR here:

space of disk needed for each file running is 3*gzipped size !

Ran with STAR/2.7.5.a
fastq files range from 3-9Gb in .gz

In STAR command I use :
> --runThreadN 8

in sbatch.sh script :

--mem-per-cpu 40GB
-c 8

Thanks for any suggestion !

The errors seems indeed due to disk space. You can check disk space with following command :
mfsgetquota -h /shared/projects/<son project>/

Which workflow are you using ? nf-core or home made one ?

Here are my quota of disk:
(current values | soft quota | hard quota) ; soft quota grace period: default
inodes | 210 | - | - |
length | 74GiB | - | - |
size | 74GiB | 466GiB | 559GiB |
realsize | 148GiB | - | - |

I am using a homemade nf pipeline, with only STAR-alignment as process. I didn't know there is a nf-core. If it is more efficient please let me know how to use it.

Moreover,
My fastq.gz files for STAR process are totally of 54Gb, so if the problem is as the developer alexdobin mentions, I need :
3 * 54 = 162Gb
Which of the above values of disk causes the problem then ?
Isn't it the 466Gb that are seen/used from the tool for alignment?
If it's the 148GiB that I really have, could we increase it ?
Otherwise, anyone knows a way to allow the alignment to happen with the given space, one-by-one file?

Hello,

'realsize' in the ouput of the mfsgetquota is the real space taken on the disk. It is not related to any limit, it is an informational output.

So in fact the command output say
that you use 74 GiB of availlable space in your project, but it take in fact 148 GiB on the disk.

The Limit of 466 GiB does not apply to the realsize, it apply to the size.

So your availlable disk space is 466-74 = 392 GiB

Please note that there is also a quota limit on your home folder.

you can run the same command to see it :

mfsgetquota -H /shared/home/<your login>

And it look like you have reach the limit allowed on it.

So I think the issue is that your workflow script write data in your home folder instead of your project folder.

You should check where output are written and use your project space to store your data.

1 J'aime

Thank you, I rerun in project space , only with
--mem-per-cpu 40GB and it worked !
I have all 5 fq aligned in 46 min !

1 J'aime

Hello, I 'd like some more precisions concerning disk-space ,so that I can make sure in advance if I have enough space or not to run a new pipeline for alignment.
I think I have the same problem with new data-set now, that I ran in

/shared/projects/mbnl_dct/mouse_seq/

$ mfsgetquota -h ./: 
(current values    |   soft quota | hard quota) 
 inodes   |  1.4Ki |            - |      - |
 length   | 167GiB |            - |      - |
 size     | 167GiB |            - |      - |
 realsize | 335GiB |            - |      - |

I ve been trying to run nextflow only with STAR process again, on 2 files only: of 16Gb and 15Gb, respectively. (4 fq paired-end, 2 samples).
The command is exactly the same as before: --runThreadN 8, and

#SBATCH --mem-per-cpu 40GB

The nextflow.log loops over the 2 alignment tasks saying "RUNNING", over 1h30! (normally in 40min should be over per file)
The STAR log file finish the alignment and bin-sorting, but there is never a .bam file written.
What might be the reason of not writing the output ?
(job IDs : 13609072, 13609073)

You have to check the quota at your project folder:

$ mfsgetquota -h /shared/projects/mbnl_dct/
/shared/projects/mbnl_dct/: (current values | soft quota | hard quota) ; soft quota grace period: default
 inodes   |  1.6Ki |      - |      - |
 length   | 185GiB |      - |      - |
 size     | 186GiB | 466GiB | 559GiB |
 realsize | 371GiB |      - |      - |

So your current quota is around 500GB. You can survey this quota during the run?
We can also extend it if you need.

Hello dear Gildas,
I am running again the alignment for 1 pair of fastq (~17Gb) and the disk quota arrives at : 368Gb , as shown here, from 362Gb before running.

 inodes   |  1.6Ki |      - |      - |
 length   | 184GiB |      - |      - |
 size     | 184GiB | 466GiB | 559GiB |
 realsize | 368GiB |      - |      - |

I asked for --mem 50Gb this time and STAR --runThreadN 8.
when I top on the node activity I see:
%CPU = 799.3 for STAR.

Isn't this too much just for one pair of fastq.
Could we increase disk space @team.ifbcorecluster ?
I ll definitely need it if I run all the samples of the project in parallel : 151Gb fastq x 3 for STAR = 453Gb more !
But I d like to understand for 1 file why it cannot finish and write down the final bam file, if it is not that a large file-size !

Thanks for your help !

Sorry I missed your message :frowning:
Is it ok with 1TB ?

@gildaslecorguille

Hi Gildas this was addressed via Extension of disk space -Project "mbnl_dct"

Thanks
Nicole