Nextflow pipeline with STAR FAILED without evident error

maria_myologie · Octobre 20, 2020, 12:53

Hello dear Cluster Team,
I d like your help to trace the problem of my last series of jobs, that Failed but without any specific error message in the log files of the tool "STAR", or of "nextflow".
It's about

jobIDs = { 13466845 - 51}, -u mkondili, on 17.October.

My doubts are mainly on the memory demanded and used from tool.
The tool STAR, which I use for alignment of fastq-files, requires 40Gb of memory, which I declare in the process when I call it.
But then, in the Sbatch I declare the following :

#SBATCH -A mbnl_dct
#SBATCH --mem 40GB
#SBATCH -n 1
#SBATCH -c 8

on the contrary, for nextflow to launch and schedule the jobs in the cores, I need max 10Gb,
but when I tried with
#SBATCH --mem 10GB the jobs failed anyway.

So I have some questions too.
Is there a limit by default for very "demanding" tools, in the cluster, for which my script exceeds ?
Or an error in the way I run the sbatch /nextflow ?

I am not sure how many cores to ask for 40Gb. Should there be an equivalence of

cores x mem/core = total Sbatch --mem ?

Is the Sbatch --mem parameter referring to the memory used by nextflow ,or the total memory by all the processes happening in main pipeline ?
I should also learn how to calculate the memory each tool consumes, to optimise my scripts.
If you have any tools/commands in the cluster for that, please share.

Thanks in advance,
I hope I have explained enough to help you understand.
Let me know if you have further questions.

julien · Octobre 20, 2020, 1:08

Hello @maria_myologie,

The --mem parameter of sbatch indicate the total amount fo RAM reserved for the job.
If you whish to have 40GB per CPU, you have to use the parameter --mem-per-cpu.

Do you have a file called slurm-13466845.err or slurm-13466845.out in your working directory. This could contain relevant information about why you job is failing.

maria_myologie · Octobre 20, 2020, 1:21

Hello Julien !

The slurm.out file contains this multiple times, but no final error message :

executor > slurm (6)
[52/560488] process > STAR_Alignment (5) [ 0%] 0 of 6

The .err file was empty actually !

I am confused by the way to run my pipeline because in the nextflow script I also define the ressources I need, and the processes are done file -by-file, so they can be assigned to a core per file.

process STAR_Alignment {
cpus "${params.cpus}"
memory "40G"
module "star/2.7.5a:perl/5.26.2"
}

The params.cpu = 8 is given every time for one process, and is same as SBATCH -c 8.
Does this seem correct ?

gildaslecorguille · Octobre 20, 2020, 3:47

Ping @team.workflow (désolé pour les snakemakeux pour la fausse alerte)

gildaslecorguille · Octobre 20, 2020, 3:49

@maria_myologie
What do you think about opening this thread to allow more contributors?

maria_myologie · Octobre 20, 2020, 3:59

Hello,
sure ,I don't mind,
It's only that I just re-run and this time, I got an error relative to the tool STAR and not to the cluster parameters or pipeline design...So it might be off topic ! Should I publish the error this time ?

julien · Octobre 20, 2020, 4:13

Yes please

maria_myologie · Octobre 20, 2020, 4:21

In slurm.out :

> Command error:
> 
>   EXITING because of FATAL ERROR: number of bytes expected from the BAM bin does not agree with the actual size on disk: Expected bin size=538502234 ; size on disk=120863510 ; bin number=47

There is a discussion with the developer of STAR here:

github.com/alexdobin/STAR

Causing space error while aligning RNA-seq reads using STAR

opened 09:31AM - 07 Jun 18 UTC

closed 07:45PM - 29 Aug 19 UTC

ammarsabircheema

Hi Alex, I am trying to align RNA-seq reads using STAR through follo…wing commands: `STAR --runMode alignReads --runThreadN 26 --outSAMtype BAM SortedByCoordinate Unsorted --sjdbOverhang 100 --genomeDir wheat --readFilesIn G3_cleaned_R1.fastq G3_cleaned_R2.fastq --sjdbGTFtagExonParentTranscript wheat.gff3 Jun 05 13:53:59 ..... started STAR run Jun 05 13:53:59 ..... loading genome Jun 05 14:07:12 ..... started mapping Jun 05 18:42:47 ..... started sorting BAM EXITING because of FATAL ERROR: number of bytes expected from the BAM bin does not agree with the actual size on disk: 509186583 0 45 EXITING because of FATAL ERROR: number of bytes expected from the BAM bin does not agree with the actual size on disk: 648593301 0 47 Jun 05 18:42:47 ...... FATAL ERROR, exiting Jun 05 18:42:47 ...... FATAL ERROR, exiting EXITING because of FATAL ERROR: number of bytes expected from the BAM bin does not agree with the actual size on disk: 660412146 0 46 Jun 05 18:42:47 ...... FATAL ERROR, exiting EXITING because of FATAL ERROR: number of bytes expected from the BAM bin does not agree with the actual size on disk: 470931698 0 44 Jun 05 18:42:47 ...... FATAL ERROR, exiting EXITING because of FATAL ERROR: number of bytes expected from the BAM bin does not agree with the actual size on disk: 653155717 0 43 Jun 05 18:42:47 ...... FATAL ERROR, exiting Segmentation fault (core dumped)` Although there is much free space on my system as shown below: `> df -Th Filesystem Type Size Used Avail Use% Mounted on devtmpfs devtmpfs 63G 0 63G 0% /dev tmpfs tmpfs 63G 136K 63G 1% /dev/shm tmpfs tmpfs 63G 1.9M 63G 1% /run tmpfs tmpfs 63G 0 63G 0% /sys/fs/cgroup /dev/sda3 ext4 3.6T 1.4T 2.2T 39% / /dev/sda2 ext4 9.8G 57M 9.2G 1% /boot tmpfs tmpfs 13G 0 13G 0% /run/user/480 tmpfs tmpfs 13G 16K 13G 1% /run/user/1000` As far as the version of STAR is concerned its latest: `STAR --version STAR_2.6.0b ` How this error can be removed?

space of disk needed for each file running is 3*gzipped size !

Ran with STAR/2.7.5.a
fastq files range from 3-9Gb in .gz

In STAR command I use :
> --runThreadN 8

in sbatch.sh script :

--mem-per-cpu 40GB
-c 8

Thanks for any suggestion !

cnoirot · Octobre 21, 2020, 7:25

The errors seems indeed due to disk space. You can check disk space with following command :
mfsgetquota -h /shared/projects/<son project>/

Which workflow are you using ? nf-core or home made one ?

maria_myologie · Octobre 21, 2020, 9:33

Here are my quota of disk:
(current values | soft quota | hard quota) ; soft quota grace period: default
inodes | 210 | - | - |
length | 74GiB | - | - |
size | 74GiB | 466GiB | 559GiB |
realsize | 148GiB | - | - |

I am using a homemade nf pipeline, with only STAR-alignment as process. I didn't know there is a nf-core. If it is more efficient please let me know how to use it.

maria_myologie · Octobre 23, 2020, 8:51

Moreover,
My fastq.gz files for STAR process are totally of 54Gb, so if the problem is as the developer alexdobin mentions, I need :
3 * 54 = 162Gb
Which of the above values of disk causes the problem then ?
Isn't it the 466Gb that are seen/used from the tool for alignment?
If it's the 148GiB that I really have, could we increase it ?
Otherwise, anyone knows a way to allow the alignment to happen with the given space, one-by-one file?

Francois · Octobre 26, 2020, 8:55

Hello,

'realsize' in the ouput of the mfsgetquota is the real space taken on the disk. It is not related to any limit, it is an informational output.

So in fact the command output say
that you use 74 GiB of availlable space in your project, but it take in fact 148 GiB on the disk.

The Limit of 466 GiB does not apply to the realsize, it apply to the size.

So your availlable disk space is 466-74 = 392 GiB

Francois · Octobre 26, 2020, 9:03

Please note that there is also a quota limit on your home folder.

you can run the same command to see it :

mfsgetquota -H /shared/home/<your login>

And it look like you have reach the limit allowed on it.

So I think the issue is that your workflow script write data in your home folder instead of your project folder.

You should check where output are written and use your project space to store your data.

maria_myologie · Octobre 26, 2020, 10:59

Thank you, I rerun in project space , only with
--mem-per-cpu 40GB and it worked !
I have all 5 fq aligned in 46 min !

maria_myologie · Novembre 5, 2020, 5:34

Hello, I 'd like some more precisions concerning disk-space ,so that I can make sure in advance if I have enough space or not to run a new pipeline for alignment.
I think I have the same problem with new data-set now, that I ran in

/shared/projects/mbnl_dct/mouse_seq/

$ mfsgetquota -h ./: 
(current values    |   soft quota | hard quota) 
 inodes   |  1.4Ki |            - |      - |
 length   | 167GiB |            - |      - |
 size     | 167GiB |            - |      - |
 realsize | 335GiB |            - |      - |

I ve been trying to run nextflow only with STAR process again, on 2 files only: of 16Gb and 15Gb, respectively. (4 fq paired-end, 2 samples).
The command is exactly the same as before: --runThreadN 8, and

#SBATCH --mem-per-cpu 40GB

The nextflow.log loops over the 2 alignment tasks saying "RUNNING", over 1h30! (normally in 40min should be over per file)
The STAR log file finish the alignment and bin-sorting, but there is never a .bam file written.
What might be the reason of not writing the output ?
(job IDs : 13609072, 13609073)

gildaslecorguille · Novembre 5, 2020, 8:35

You have to check the quota at your project folder:

$ mfsgetquota -h /shared/projects/mbnl_dct/
/shared/projects/mbnl_dct/: (current values | soft quota | hard quota) ; soft quota grace period: default
 inodes   |  1.6Ki |      - |      - |
 length   | 185GiB |      - |      - |
 size     | 186GiB | 466GiB | 559GiB |
 realsize | 371GiB |      - |      - |

So your current quota is around 500GB. You can survey this quota during the run?
We can also extend it if you need.

maria_myologie · Novembre 9, 2020, 11:19

Hello dear Gildas,
I am running again the alignment for 1 pair of fastq (~17Gb) and the disk quota arrives at : 368Gb , as shown here, from 362Gb before running.

 inodes   |  1.6Ki |      - |      - |
 length   | 184GiB |      - |      - |
 size     | 184GiB | 466GiB | 559GiB |
 realsize | 368GiB |      - |      - |

I asked for --mem 50Gb this time and STAR --runThreadN 8.
when I top on the node activity I see:
%CPU = 799.3 for STAR.

Isn't this too much just for one pair of fastq.
Could we increase disk space @team.ifbcorecluster ?
I ll definitely need it if I run all the samples of the project in parallel : 151Gb fastq x 3 for STAR = 453Gb more !
But I d like to understand for 1 file why it cannot finish and write down the final bam file, if it is not that a large file-size !

Thanks for your help !

gildaslecorguille · Novembre 24, 2020, 5:00

Sorry I missed your message
Is it ok with 1TB ?

nc-support · Novembre 24, 2020, 5:13

@gildaslecorguille

Hi Gildas this was addressed via Extension of disk space -Project "mbnl_dct"

Thanks
Nicole