Maximum allowed job run time

simremy · Avril 21, 2020, 7:00

Hello,
I'm new on W4M. I did a PCA on a big dataset (300 samples) which ran well. However my PLS analysis ran over the walltime.

Could you help?

Thank you
Simon

Galaxy Tool ID:	toolshed.g2.bx.psu.edu/repos/ethevenot/multivariate/Multivariate/2.3.10
Galaxy Tool Version:	2.3.10
Tool Version:	None
Tool Standard Output:	stdout
Tool Standard Error:	stderr
Tool Exit Code:	None
History Content API ID:	6eece17699c62b74
Job API ID:	777ca7cf1c191b30
History API ID:	49f747565f9af515
UUID:	85bac157-afee-42a2-8da7-c0f754be2cb5

gildaslecorguille · Avril 21, 2020, 2:14

@ethevenot, the current setting is 12h.
Is it normal for you that it take so long?

ethevenot · Avril 21, 2020, 4:48

I do not know how many features you have but this running time seems definitely too long (it should take < 1 min). Feel free to share your history with me (etienne.thevenot@cea.fr) or send me the 3 .tsv tables if you wish.

Best wishes,

Etienne.

ethevenot · Avril 22, 2020, 12:58

Which is the response feature you are trying to predict by PLS(-DA)?

simremy · Avril 22, 2020, 1:50

Hello Etienne,

I'm trying to predict the Family

ethevenot · Avril 22, 2020, 2:50

I did a few cheks on your data and here are some comments.

I reproduce your results: PCA OK in a few seconds and PLS-DA to predict the 'family' response is still not converging after many seconds.
Your dataset contains ~100X more features than samples (so a high risk of overfitting), and two of your 5 classes contain less than 7 samples (which is very few; in any case, the cross-validation argument, which is 7 by default, should be lowered).
I therefore focused on the 2 classes with the most of the samples, randomly selected 1/10 of the features, and selected only two components. I end up in a few seconds with a significant model. I could also obtain a significant model with the 3 main classes and 2 (or 3) components in a few minutes.

simremy · Avril 22, 2020, 3:06

Thank you very much ! I've been struggling a while with that dataset !

So do you suggest that I should :

Increase my detection threshold to keep less features during pre-processing?
Replace the classes with few samples by a class "other" in my sampleMetadata file?

Thanks again

ethevenot · Avril 22, 2020, 3:15

I assume that some (many) of your features are noise which should indeed be filtered during preprocessing or quality controls.
I would first analyze your 3 main classes in the classical way. You might also group the remaining samples in a 4th class.