Problème utilisation Alphafold

agroppi · Mars 11, 2024, 3:54

Bonjour,

Malgré l'accès à la partition GPU je suis bloqué par des erreurs lorsque j'essaye de l'utiliser (Alphafold)
Mon script :

#!/bin/bash

#SBATCH -A physicochimstruct3d
#SBATCH -p gpu
#SBATCH --gres=gpu:3g.20gb:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=50G
#SBATCH --job-name=Alphafold_M80K
#SBATCH -o %x.o%j
#SBATCH -t 24:0:0

module load alphafold/2.3.2

run_alphafold.sh \
    --fasta_paths=/shared/ifbstor1/projects/physicochimstruct3d/M80K.fasta \
    --output_dir=/shared/projects/physicochimstruct3d \
    --data_dir=/shared/bank/alphafold2/current \
    --db_preset=full_dbs \
    --model_preset=monomer_ptm \
    --models_to_relax=best \
    --use_gpu_relax=true \
    --max_template_date=2023-07-11 \
    --use_precomputed_msas=false \
    --uniref90_database_path=/shared/bank/alphafold2/current/uniref90/uniref90.fasta \
    --mgnify_database_path=/shared/bank/alphafold2/current/mgnify/mgy_clusters_2022_05.fa \
    --template_mmcif_dir=/shared/bank/alphafold2/current/pdb_mmcif/mmcif_files \
    --obsolete_pdbs_path=/shared/bank/alphafold2/current/pdb_mmcif/obsolete.dat \
    --bfd_database_path=/shared/bank/alphafold2/current/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --pdb70_database_path=/shared/bank/alphafold2/current/pdb70/pdb70 \
    --uniref30_database_path=/shared/bank/alphafold2/current/uniref30/UniRef30_2021_03

ma commande :

srun -A physicochimstruct3d Alphafold_M80K.sh

Les erreurs que j'ai :

I0306 14:31:00.629313 47397026205376 templates.py:857] Using precomputed obsolete pdbs /shared/bank/alphafold2/current/pdb_mmcif/obsolete.dat.
I0306 14:31:01.879528 47397026205376 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
2024-03-06 14:31:01.880753: W external/org_tensorflow/tensorflow/tsl/platform/default/dso_loader.cc:66] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2024-03-06 14:31:01.880851: W external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
I0306 14:31:01.882535 47397026205376 xla_bridge.py:353] Unable to initialize backend 'cuda': FAILED_PRECONDITION: No visible GPU devices.
I0306 14:31:01.882835 47397026205376 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Host CUDA Interpreter
I0306 14:31:01.883238 47397026205376 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I0306 14:31:01.883320 47397026205376 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
W0306 14:31:01.883430 47397026205376 xla_bridge.py:360] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
/shared/ifbstor1/software/miniconda/envs/alphafold-2.3.2/bin/run_alphafold.sh: line 3: 11013 Killed                  python /shared/ifbstor1/software/miniconda/envs/alphafold-2.3.2/bin/run_alphafold.py "$@"
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=38241202.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: cpu-node-88: task 0: Out Of Memory

Merci de votre aide

Cordialement,

Alexis Groppi