Multinomial Regression on Wide Datasets
Dataset
The authors of “Clustering Cancer Gene Expression Data: a Comparative Study”, clustered gene expression data for various cancers. Each of the 35 data sets is “wide” meaning it has more predictors than examples, and in this case many more predictors than examples (typically more than 1000 predictors and less than 100 examples). My goal was to assign a class from a subset of various medical issues. Each data set has a different number of possible classes, between 2 and 14.
The actual data can be found here: http://bioinformatics.rutgers.edu/Static/Supplements/CompCancer/datasets.htm
A more detailed description of the data can be found here: http://bioinformatics.rutgers.edu/Supplements/CompCancer/
Implementation
Github repo available here: https://github.com/dcyoung/MultinomialRegression
First the datasets were checked for missing or non-numeric values. Only 1 dataset contained such values and they were replaced by zeros. Experimentation with further pre-processing (dimensionality reduction via PCA) did not yield better performance, but actually decreased performance. Because of this, the number of predictors was left intact and fed to the regression where the lasso would effectively prune unnecessary predictors.
For each dataset, a multinomial regression model with the lasso was constructed and used to predict the relevant classes. At first the loss for cross-validation was left at its default type.measure of “deviance” which uses the squared-error for Gaussian models. While this was producing 100% training accuracy, the cv.glment cross validation error for the best model was not optimal. Changing the loss type.measure to “class” forced the package to use the misclassification error (only suitable for binomial or multinomial regression). Using the misclassification error for loss drastically improved the cross validation error for the best model. There were many cases where the error dropped from around 30% to less than 10%.
Sometimes, the best model yielded a misclassification error that was only slightly better than a model which used far fewer coefficients. Depending on the application, model complexity may be more important than absolute lowest misclassification error. If that is the case, one might wish to examine a plot of misclassification rate vs. regularization constant (representing # of included coefficients) and choose by hand which model strikes the right balance between complexity and performance. For these data sets however, complexity was not greatly affecting runtime, so the model which yielded the lowest misclassification error was chosen in all cases. More specifically, the model was chosen by selecting the regularization constant which yielded the lowest misclassification error. The number of genes used in the model is represented by the number of coefficients included in the model at that regularization constant.
Results
The results for each data set are outlined in the following table. I’ve included the chosen regularization constant, the training error and a ratio of the # of included genes/ # of classes. Lastly I included an average at the bottom. Some of the average values are relatively meaningless given the diversity of the dataset parameters, but some are noteworthy. The average CV misclassification error was around 12%, which is impressive given how wide the datasets were. This is a really powerful technique! The average training error was around 2-3% which just confirms the model was building correctly, while the misclassification error is a better representation of its performance. The average ratio of #genes/#classes is a little over 4. Meaning on average a model required 4x as many non-zero coefficients (or predictive genes) as it had classes from which to classify. This must be taken with a large grain of salt however… as described previously, for some data sets the best model performed only slightly better than a much simpler model. Choosing a good balance of complexity to performance would yield a more representative average of # of genes/# of classes. But I thought I would include the statistic anyways, as it is interesting.
Data Base File Name: | # of Classes | Regularization Constant (Lambda) for best model | Coefficients (# included genes) | Cross Validation Misclassification Error | Training Error | Ratio #Genes/#Classes |
---|---|---|---|---|---|---|
alizadeh-2000-v1_database.txt | 2 | 0.01916398 | 24 | 0.09523810 | 0.00000000 | 12.00 |
alizadeh-2000-v2_database.txt | 3 | 0.12037204 | 5 | 0.01612903 | 0.00000000 | 1.67 |
alizadeh-2000-v3_database.txt | 4 | 0.06157294 | 11 | 0.08064516 | 0.00000000 | 2.75 |
armstrong-2002-v1_database.txt | 2 | 0.08594995 | 9 | 0.05555556 | 0.01388889 | 4.50 |
armstrong-2002-v2_database.txt | 3 | 0.01195216 | 14 | 0.08333333 | 0.00000000 | 4.67 |
bhattacharjee-2001_database.txt | 5 | 0.00702128 | 10 | 0.07389163 | 0.00000000 | 2.00 |
bittner-2000_database.txt | 2 | 0.00549008 | 20 | 0.10526316 | 0.00000000 | 10.00 |
bredel-2005_database.txt | 3 | 0.02097127 | 11 | 0.14000000 | 0.00000000 | 3.67 |
chen-2002_database.txt | 2 | 0.02845673 | 18 | 0.03910615 | 0.01675978 | 9.00 |
chowdary-2006_database.txt | 2 | 0.07596433 | 6 | 0.01923077 | 0.01923077 | 3.00 |
dyrskjot-2003_database.txt | 3 | 0.05072785 | 9 | 0.22500000 | 0.00000000 | 3.00 |
garber-2001_database.txt | 4 | 0.11605594 | 4 | 0.15151515 | 0.12121212 | 1.00 |
golub-1999-v1_database.txt | 2 | 0.02017088 | 19 | 0.05555556 | 0.00000000 | 9.50 |
golub-1999-v2_database.txt | 3 | 0.12477965 | 5 | 0.05555556 | 0.04166667 | 1.67 |
gordon-2002_database.txt | 2 | 0.02947920 | 16 | 0.02209945 | 0.00000000 | 8.00 |
khan-2001_database.txt | 4 | 0.13232613 | 5 | 0.01204819 | 0.00000000 | 1.25 |
laiho-2007_database.txt | 2 | 0.11268743 | 13 | 0.13513514 | 0.02702703 | 6.50 |
lapointe-2004-v1_database.txt | 3 | 0.08378991 | 9 | 0.19117647 | 0.04411765 | 3.00 |
lapointe-2004-v2_database.txt | 4 | 0.02441359 | 19 | 0.16513762 | 0.00000000 | 4.75 |
liang-2005_database.txt | 3 | 0.12153683 | 1 | 0.02702703 | 0.00000000 | 0.33 |
nutt-2003-v1_database.txt | 4 | 0.00630569 | 14 | 0.34000000 | 0.00000000 | 3.50 |
nutt-2003-v2_database.txt | 2 | 0.00471073 | 17 | 0.17857143 | 0.00000000 | 8.50 |
nutt-2003-v3_database.txt | 2 | 0.35731114 | 3 | 0.27272727 | 0.31818182 | 1.50 |
pomeroy-2002-v1_database.txt | 2 | 0.02637890 | 14 | 0.05882353 | 0.00000000 | 7.00 |
pomeroy-2002-v2_database.txt | 5 | 0.07076972 | 7 | 0.23809524 | 0.00000000 | 1.40 |
ramaswamy-2001_database.txt | 14 | 0.02225367 | 8 | 0.30000000 | 0.07368421 | 0.57 |
risinger-2003_database.txt | 4 | 0.06281940 | 6 | 0.14285714 | 0.00000000 | 1.50 |
shipp-2002-v1_database.txt | 2 | 0.10239171 | 15 | 0.16883117 | 0.15584416 | 7.50 |
singh-2002_database.txt | 2 | 0.09746390 | 7 | 0.05882353 | 0.06862745 | 3.50 |
su-2001_database.txt | 10 | 0.02999744 | 7 | 0.08620690 | 0.00574713 | 0.70 |
tomlins-2006_database.txt | 5 | 0.03332642 | 18 | 0.18269231 | 0.00961539 | 3.60 |
tomlins-2006-v2_database.txt | 4 | 0.04036825 | 12 | 0.14130435 | 0.00000000 | 3.00 |
west-2001_database.txt | 2 | 0.06085212 | 14 | 0.12244898 | 0.00000000 | 7.00 |
yeoh-2002-v1_database.txt | 2 | 0.00587707 | 16 | 0.00806452 | 0.00000000 | 8.00 |
yeoh-2002-v2_database.txt | 6 | 0.02372582 | 12 | 0.12903226 | 0.01209677 | 2.00 |
Average: | 3.54 | 0.06278383 | 11.37 | 0.1193 | 0.0265 | 4.33 |
Comments