Unable to run mpi parallel scripts

nmendiboure · Juillet 24, 2024, 11:02

Hi,
It has been a few days that I fail to run properly my mpi script. I start using 1 node / 8 tasks (not so large ressources) and my job almost fails every time. Looking at the error.log I get :

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        cpu-node-91
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

Here is my slurm script :

#!/bin/bash
#SBATCH --job-name=genomatch_job                      # Job name
#SBATCH --output=output_%j.log                        # Standard output log (%j is replaced with the jobID)
#SBATCH --error=error_%j.log                          # Standard error log
#SBATCH --nodes=1                                     # Number of nodes
#SBATCH --ntasks=8                                    # Total number of tasks (should match the number of tasks per node)
#SBATCH --ntasks-per-node=8                           # Number of tasks per node
#SBATCH --cpus-per-task=1                             # Number of CPUs per task
#SBATCH --mem=32G                                     # Total memory allocation (adjust as needed)
#SBATCH --time=04:00:00                               # Time limit (D-HH:MM:SS, adjust as needed)
#SBATCH --partition=fast                              # Partition to use
#SBATCH --mail-type=BEGIN,END,FAIL                    # Notifications for job start, end, and fail
#SBATCH --mail-user=nicolas.mendiboure@ens-lyon.fr    # Change this to your email


# Set home directory and data directories
DATADIR=/shared/projects/genomatch/data
INPUTDIR=$DATADIR/inputs
GENOME=$INPUTDIR/S288c-Lys2.fa
SPARSE=$INPUTDIR/AD265-266/AD265_AD266_merged_S288c_DSB_chr3_rDNA_cutsite_q20.txt
FRAGS=$INPUTDIR/AD265-266/fragments_list_S288c_chr3_DpnIIHinfI.txt
CHROM=$INPUTDIR/AD265-266/info_contigs_S288c_chr3_DpnIIHinfI.txt
K=8

source ~/.bashrc

# Activate the conda environment
conda activate genomatch_env

module load openmpi

# Log some useful information
echo "Job started on $(hostname) at $(date)"
echo "Running on ${SLURM_NTASKS} total tasks"
echo "Running ${SLURM_TASKS_PER_NODE} tasks per node"
echo "Allocated memory: ${SLURM_MEM_PER_NODE} MB"

# Run the script
mpirun --bind-to core -np ${SLURM_NTASKS} genomatch kmerize -g $GENOME -s $SPARSE -f $FRAGS -c $CHROM -k $K -b 20kb -F

'genomatch' is the name of my python module, with its the subcommand kmerize (and its args).

I tried with and without "--bind-to" and "--map-by" options, nothing changed.

On my personal computer the command mpirun (or mpiexec) just works fine, but as soon as I switch to the ifb cluster and use a slurm script, it doesn't work.

I am very new to slurm and HPC in general, its quite possible that I miss some points concerning the proper definition of my slurm.

Thank you in advance