Jobs anormalement longs

maeva_gabrielli · Décembre 4, 2019, 12:29

Bonjour,

Je constate depuis hier soir que le cluster semble comme fonctionner 'au ralenti' en tout cas depuis ma session (mgabrielli, projet zobo). Des jobs qui tournent normalement en 30 minutes ont eu besoin de 5h pour finir. A l'arrivée, tout est normal, pas de message d'erreur, mais c'est juste beaucoup plus long que prévu! En lançant exactement les mêmes jobs qu'hier matin, ils tournent aujourd'hui en environ 1h alors qu'hier ils étaient finis en 7 minutes. De même, toutes les commandes de base (cp, mv, wc -l, écriture d'un fichier) sont plus lentes que d'habitude. Auriez-vous une idée du problème s'il-vous-plait?

Merci d'avance pour votre aide

Maëva

emorice · Décembre 5, 2019, 1:45

Hello,
I am not an IFB Cluster admin, but I've experienced slow-downs that seem related. I may have ideas concerning the bottleneck you are facing. Do your jobs, taken together, involve creating/writing/reading/deleting many small files ? Could you give an estimate of the number of files you may be operating on at the same time (or around the same time, say, in a one-minute window), both per SLURM job and globally with the maximum number of parallel jobs you are running ? And the typical size of these files ?
Best regards,
Étienne

maeva_gabrielli · Décembre 5, 2019, 5:01

Hi Etienne,
Thank you very much for your answer and your help. Indeed, my jobs do involve operating on many different files at the same time I think. For instance, a 7 minute job is executing 200 times a python script, each time reading one file and producing several smaller files. And I'm running 6 of these 7minutes jobs in parallel so that 1,100 files are read and 20,000 files are created, and each of those new files are 2,4Mo. But it is exactly the same kind of jobs that used to be fast until tuesday night. So you think the problem could be that there are too many files in my account now, which is slowing down the writings of new files? Currently, there are around 400,000 files in my session. And could this also explain why simple commands of wc -l for instance are slower?
Best,
Maëva

camille_roux · Décembre 6, 2019, 1:07

Hi,
Currently experimenting the same issue than Maeva ! It becamse suddenly slower to read and write files.
Don't know what we can do.
Cheers,

Camille

dbenaben · Décembre 6, 2019, 4:03

Hello everyone,

Sorry for this late reply.

We also think that the slowness comes from the creation of several million of files and I/O intensives operations.

We continue to investigate and tune the storage to avoid this issue.

In the mid term, we also plan to change the storage infrastructure (with a new solution and/or by creating a "scratch" storage).
This project is already started but it takes some time to bought this solution and to deployed it.

Right now, we don't see important slowness.
Please tell us if you're facing again some slowness issues.

Best regards,

jvanhelden · Décembre 14, 2019, 3:14

Hier après-midi j'ai constaté une très grande lenteur du servuer RStudio ainsi qu'avec la connnexion ssh. Cela met des dizaines de secondes pour lancer des tâches minuscules (git pull avec des tout petits changements, affichage de l'écran de bienvenue, ...).

@julien m'a signalé que c'était peut-être lié à cette discussion-ci, je fais donc le lien.

gildaslecorguille · Décembre 16, 2019, 4:25

Bonjour à tous,

Voici une réponse officielle de l'équipe à ce problème : Problème de performance sur l'IFB Core Cluster

Merci de votre patience et désolé pour ce désagrément