Need Help with Bioinformatics Pipeline Optimization

Hello all,

I'm currently tuning a bioinformatics pipeline for processing big genomic data sets, and I could use some guidance from experts. The current process includes quality control (FastQC), trimming (Trimmomatic), alignment (BWA), and variant calling (GATK). But I see inefficiencies in runtime and resource consumption, particularly with large data sets.

Some questions:

Are there any best practices guidelines for resource optimization (CPU/RAM) within an HPC setup for such pipelines?

Would a transition to Snakemake or Nextflow make a substantial difference in efficiency over a typical shell-scripted pipeline?

Any recommended DevOps tutorials focused on scientific computing?

Any tips, tool suggestions, or experiences shared would be most welcome! Thanks in advance.

Looking forward to feedback from this great community.

Best regards
jenniferellino

Hello,

But I see inefficiencies in runtime and resource consumption, particularly with large data sets.

First, you should check the resources usage.
See:

Quick answer:

module load reportseff  
reportseff  

Are there any best practices guidelines for resource optimization (CPU/RAM) within an HPC setup for such pipelines?

My main recommendation is to use "small" jobs: one job per step, one job per dataset (or dataset partition). This could reduce job queue time.
And to check the resources usage.

Would a transition to Snakemake or Nextflow make a substantial difference in efficiency over a typical shell-scripted pipeline?

In my opinion, there wouldn’t be a substantial difference in raw efficiency.
However, Snakemake and Nextflow can easily relaunch failed parts, and ultimately save time. Snakemake or Nextflow also enable scalable and reproducible scientific workflows, etc.
So it's highly recommended.

Any recommended DevOps tutorials focused on scientific computing?

From my point of view, DevOps is primarily about continuous integration and deployment (CI/CD).
But version control, workflow managers, and containerization can do a lot for scientific computing by improving reproducibility, sharing, traceability, ...
Some resources: