AlphaFold TensorFlow failed

Hello,
I'm trying to run AlphaFold at the beginning of the running of the script I'm getting this error

I0420 13:57:32.087283 47661089406656 xla_bridge.py:230] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker:
I0420 13:57:33.208311 47661089406656 xla_bridge.py:230] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.

And the script aborts with these error message

2022-04-20 14:13:44.792693: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2022-04-20 14:13:44.792754: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:113] Check failed: stream->parent()->GetBlasGemmAlgorithms(&algorithms)
Fatal Python error: Aborted

Current thread 0x00002b58f64dbec0 (most recent call first):
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/jax/interpreters/xla.py", line 474 in backend_compile
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/jax/interpreters/xla.py", line 863 in compile_or_get_cached
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/jax/interpreters/xla.py", line 921 in from_xla_computation
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/jax/interpreters/xla.py", line 892 in compile
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/jax/interpreters/xla.py", line 759 in _xla_callable_uncached
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/jax/linear_util.py", line 263 in memoized_fun
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/jax/interpreters/xla.py", line 687 in _xla_call_impl
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/jax/core.py", line 627 in process_call
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/jax/core.py", line 1635 in process
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/jax/core.py", line 1623 in call_bind
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/jax/core.py", line 1632 in bind
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/jax/_src/api.py", line 416 in cache_miss
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162 in reraise_with_filtered_traceback
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/alphafold/model/model.py", line 167 in predict
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/bin/run_alphafold.py", line 193 in predict_structure
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/bin/run_alphafold.py", line 403 in main
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/absl/app.py", line 258 in _run_main
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/lib/python3.8/site-packages/absl/app.py", line 312 in run
  File "/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/bin/run_alphafold.py", line 427 in <module>
/shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/bin/run_alphafold.sh: line 3: 62559 Aborted                 (core dumped) python /shared/ifbstor1/software/miniconda/envs/alphafold-2.1.1/bin/run_alphafold.py "$@"
srun: error: gpu-node-01: task 0: Exited with exit code 134

Submission script is located there /shared/home/akiselev2/aphanoclust/SSP/AlphaFold/AF_test_042022.sh

Slurm output is there /shared/home/akiselev2/aphanoclust/SSP/AlphaFold/slurm-22072247.out

@team.alphafold

Thank you in advance
Andrei

Hi,

AFAICT Alphafold requires the larger GPUs (you should try with 7g.40gb which are available on gpu-node-03).
I'm not sure you would get these cryptic errors in case of not enough GPU memory but it's worth trying before looking further.

Have a nice day,
J.C. Haessig

Hi, I still have the error, when launching on 7g.40gb

I0421 17:06:21.900038 47184954543808 xla_bridge.py:230] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker:
I0421 17:06:23.029157 47184954543808 xla_bridge.py:230] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.

The parameters of batch script are following:



#!/bin/bash

#SBATCH -p gpu
#SBATCH --gres=gpu:7g.40gb
#SBATCH --cpus-per-task=10
#SBATCH --mem=50G

module load alphafold/2.1.1

mkdir -p /tmp/$USER_alphafold

srun run_alphafold.sh --fasta_paths=test.fasta \
    --output_dir=/shared/home/akiselev2/aphanoclust/SSP/AlphaFold \
    --model_preset=monomer \
    --db_preset=reduced_dbs \
    --small_bfd_database_path=/shared/bank/alphafold2/current/small_bfd/bfd-first_non_consensus_sequences.fasta \
    --data_dir=/shared/bank/alphafold2/current \
    --uniref90_database_path=/shared/bank/alphafold2/current/uniref90/uniref90.fasta \
    --mgnify_database_path=/shared/bank/alphafold2/current/mgnify/mgy_clusters_2018_12.fa \
    --pdb70_database_path=/shared/bank/alphafold2/current/pdb70/pdb70 \
    --template_mmcif_dir=/shared/bank/alphafold2/current/pdb_mmcif/mmcif_files \
    --max_template_date=2020-05-14 \
    --obsolete_pdbs_path=/shared/bank/alphafold2/current/pdb_mmcif/obsolete.dat

Hi, despite the error "Unable to initialize backend 'tpu_driver" the test run successfully finished

Hi,
Yes I did not remember having this specific error during my tests but there were a number of 'normal errors' that wouldn't prevent completion of the program. One must remember that Alphafold is still laregly beta software.

Have a nice weekend,
J.C. Haessig

Thank you for your answer.

I tested different configurations. It works on 7g.40gb and doesn't on 1g.5gb