Limitation in download

puthier · Décembre 15, 2021, 6:27

Hello,
I am trying to re-analyze a smart-seq2 experiment containing 20k cells and thus 20k fastq. My goal is not to keep the fastq but just almost to trim them and quantify them with kallisto (the whole thing is done and controlled with snakemake). I limited the number of downloads in parallel to ~10-15 to not spam ENA too much and at the beginning it worked well (I'm not too much in a hurry...). Since a few hours the downloads rules from snakemake became very slow (in fact they get stuck for minutes and finally start after several minutes). When I test a download without going through slurm (and thus the nodes) it is almost instantaneous. When I test a wget using srun prefix, it is stuck (for several minutes before completing)... My question is to know if there is a limitation on the download flow on the slurm side. Or something else that I don't understand... Thank you for helping.
Best

puthier · Décembre 16, 2021, 7:17

Hello,
Well, sorry, but most likely I have exceeded the download limits and blacklisted... I didn't really realised that the number of cumulated hits could have this impact but I was stupid... We will have to contact, I think, the ftp.sra.ebi.ac.uk administrators. I can do it if you want. I would like to apologise to the IFB cluster admin and to the other users for the inconvenience.
Denis

puthier · Décembre 16, 2021, 7:23

Sorry, I forgot to provide an example to make it clear...

  srun wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR381/006/ERR3814476/ERR3814476_1.fastq.gz

dbenaben · Décembre 16, 2021, 1:08

Hello Denis,

Indeed, there is no limitation on the download flow on the slurm side but it seems that we have exceeded the download limits on ENA server (and maybe blacklisted). It happens, don't worry.

We look at this and we come back to you.

puthier · Décembre 16, 2021, 2:19

OK. Thanks a lot !

dbenaben · Décembre 16, 2021, 4:54

Denis, it works again.
I think, the IP address is temporarily banned by ENA something like 24 hours.

Please be careful, especially with multiple jobs, when you request or download data from servers. Data should only be retrieve once and often one at a time.
All site (servers, enterprise, etc) have some restrictions to protect themself and often ban IP address when there is too much requests.
Moreover, these servers often see our cluster (compute nodes) or a site (training room, laboratory) as "only one address". So the ban can impact everyone.

Best regards

puthier · Décembre 17, 2021, 9:03

Hi David,
I have limited the download to 1 fastq in parallel. But as I have 20k fastq to download it should irremediably make lots of connexions (their size seems to range from 20Mo to ~500Mo). Perhaps I should use ftp/lftp/ncftp to do a single connexion to download let say 2k fastq and then go on with analysis. But for this second solution I don't know any way to synchronise in case of failure during download (a kind of rsync)...
Thank you for helping.
Denis

dbenaben · Décembre 17, 2021, 4:53

Hi Denis,

Indeed, it's not straight forward.

I should use ftp/lftp/ncftp to do a single connexion

Yes, it might be a solution. I really like lftp to manage ftp transfers.
There is some commands like mirror, get -c (continue, reget) or options like -f <file> (to execute commands from a file and exit) that can help.
For example, I test with success:

$ cat cmds.txt 
get -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR381/006/ERR3814476/ERR3814476_1.fastq.gz
get -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR381/006/ERR3814476/ERR3814476_1.fastq.gz
get -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR164/ERR164407/ERR164407.fastq.gz

$ lftp -f cmds.txt

No new download if file exist, re-get if file change.

Maybe It can help to manage the download and the ENA server limitation ?

puthier · Décembre 17, 2021, 5:15

Thanks a lot David,
Looks promising. Currently, I will go on with the current solution (as I rename the file with sample info during installation). But I will clearly think about this solution for the next analysis. It helps a lot.
Bravo pour cette solution et "Bon bout d'an" comme on dit à Marseille :).

dbenaben · Décembre 21, 2021, 9:01

Hello Denis,

ENA IT support came back to us and added our site on whitelist.

But it seems than the ban doesn't come from too many downloads but from bad/incomplete requests:

[...] you were initially banned because of what we consider 'bad requests' , meaning that your download mechanism was not finishing the file downloads in a complete manner. Please investigate the origin of the problem at your end.

Do you have an idea or any log that can help us to understand ?

"Bon bout d'an" également

puthier · Janvier 3, 2022, 8:37

Hi David,
Happy new year and thanks for your support.
This is probably because I have restarted and killed my workflow several time before launching the final version. I have restarted it many time because I was struggling with on snakemake attribute (resources) that, in my workflow, was controlling the balance between rules and limiting the number of concurrent download. I don't remember exactly but it may be related to that. I will be more careful about killing downloads in the future....
Thanks a lot.
Best