AlphaPullDown not running on GPU

matthieu_haudiquet · Octobre 19, 2023, 2:48

Bonjour,

J'essaye de faire tourner AlphaPullDown ( GitHub - KosinskiLab/AlphaPulldown ) sur le cluster. Je l'ai installé via conda dans un environnement distinct en suivant la doc sur github.

Je pense que TensorFlow ne trouve pas le GPU quand il lance AlphaFold:

#SBATCH --job-name=array
#SBATCH --time=2-00:00:00

#log files:
#SBATCH -e logs/run_multimer_jobs_%A_%a_err.txt
#SBATCH -o logs/run_multimer_jobs_%A_%a_out.txt


#SBATCH -p gpu

#SBATCH --gres=gpu:3g.20gb:1

#Adjust this depending on the node
#SBATCH --ntasks=1
#SBATCH --mem=100000

module load conda
source activate /shared/projects/mdm_db_computations/Matthieu/AlphaPulldown
module load alphafold/2.3.2

run_multimer_jobs.py --mode=pulldown \
    --num_cycle=3 \
    --num_predictions_per_model=1 \
    --output_path=/shared/projects/mdm_db_computations/Matthieu/Pulldown/results \
    --data_dir=/shared/bank/alphafold2/current/ \
    --protein_lists=eMLV.txt,HSV1.txt \
    --monomer_objects_dir=/shared/projects/mdm_db_computations/Matthieu/HSV1/HSV1 \
    --job_index=$SLURM_ARRAY_TASK_ID

J'ai essayé en ajoutant module load alphafold/2.3.2 mais cela ne change pas les warnings suivant :

2023-10-19 13:30:19.041668: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2023-10-19 13:30:19.042669: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
I1019 13:30:19.042827 47045410463424 utils.py:214] checking if output_dir exists /shared/projects/mdm_db_computations/Matthieu/Pulldown/results
I1019 13:30:20.055875 47045410463424 run_multimer_jobs.py:185] done creating multimer eMLV_Env_and_A0A0B5EB06-A0A0B5EB06_HHV1_Ribonucleoside-diphosphate_reductase_large_subunit_OS_Human_herpesvirus_1_OX_10298_GN_UL39_PE_3_SV_1
I1019 13:30:20.057469 47045410463424 run_multimer_jobs.py:304] object: eMLV_Env_and_A0A0B5EB06-A0A0B5EB06_HHV1_Ribonucleoside-diphosphate_reductase_large_subunit_OS_Human_herpesvirus_1_OX_10298_GN_UL39_PE_3_SV_1
I1019 13:30:20.475626 47045410463424 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
I1019 13:30:20.565883 47045410463424 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Host CUDA Interpreter
I1019 13:30:20.566185 47045410463424 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I1019 13:30:20.566252 47045410463424 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
I1019 13:30:24.516504 47045410463424 utils.py:252] Using random seed 168447753001192063 for the data pipeline
I1019 13:30:24.519200 47045410463424 run_multimer_jobs.py:273] now running prediction on eMLV_Env_and_A0A0B5EB06-A0A0B5EB06_HHV1_Ribonucleoside-diphosphate_reductase_large_subunit_OS_Human_herpesvirus_1_OX_10298_GN_UL39_PE_3_SV_1

Pourriez-vous m'aider à vérifier si c'est le cas ? Par exemple pour le job 35632983_1

Cordialement,
Matthieu

dbenaben · Octobre 20, 2023, 7:53

Bonjour Matthieu,

Je ne connais pas de commande pour vérifier à posteriori (une fois le job terminé) l'usage du GPU.
Par contre, il est possible lorsque le job est "runnnig". de lancer la commande nvidia-smi sur le noeud GPU (cf.: https://ifb-elixirfr.gitlab.io/cluster/doc/troubleshooting/#gpu)

Par exemple, si votre job s'exécute sur le gpu-node-03, vous pouvez visualiser l'usage des GPU avec:

ssh gpu-node-03 nvidia-smi

Vous pourrez alors voir si votre process utilise bien le GPU.

Dites-nous si ça vous convient.

matthieu_haudiquet · Octobre 25, 2023, 7:15

Bonjour, merci pour votre aide. J'ai pu vérifier et effectivement cela ne fonctionnait pas.
J'ai installé AlphaPullDown avec pip hors d'un environnement conda, et il tourne une fois que je fais module load alphafold/2.3.2 donc le pb est résolu
Bonne soirée

dbenaben · Octobre 26, 2023, 7:28

Parfait. Merci pour votre retour !