This is an old revision of the document!
Department of Mathematical Sciences
|DATE:||Thursday, October 8, 2015|
|TIME:||1:15pm to 2:40pm|
|SPEAKER:||Qiyi Lu, Binghamton University|
|TITLE:||Learning Partially Labeled Data in the High-dimensional, Low-sample Size Setting|
The High-Dimensional, Low-Sample Size (HDLSS) data setting is very challenging for statistical learning and it occurs in many applied areas such as gene expression micro-array analysis, facial recognition, medical image analysis and text classification. In many real applications, it is costly to manually place labels on observations; hence it is often that only a small portion of labeled data is available while a large number of observations are left without a label. It is a great challenge to obtain good statistical learning performance through the labeled data alone, especially when the dimension is greater than the size of the labeled data. We are interested in learning this type of data in two areas: classification and significance analysis.
Classification is an important tool with many useful applications. Among the many classification methods, Fisher's Linear Discriminant Analysis (LDA) is a traditional model-based approach which makes use of the covariance information. However, in the HDLSS setting, LDA cannot be directly deployed because the sample covariance is not invertible. While there are modern methods designed to deal with high-dimensional data, it is hard to obtain good classification performance on the labeled data alone when the data are partially labeled. In order to overcome these issues, we propose a semi-supervised sparse LDA classifier to take advantage of the seemingly useless unlabeled data. They provide additional information which helps to boost the classification performance in some situations.
Before applying a classification/clustering method, a natural question is whether predefined classes are really different from one another, or whether clusters are really there. Although they are challenging questions in the HDLSS setting, there has been some recent development for both. We propose a significance analysis approach for a different type of data, namely partially labeled data. Our method makes use of the whole data and tries to test the class difference as if all the labels were observed. Compared to a testing method that ignores the label information, our method provides a greater power, meanwhile, maintaining the size.
Both of these two proposed methods are designed for partially labeled data in the HDLSS setting. Theoretical properties are studied. Our simulated and real data examples help to understand and illustrate the usefulness of the proposed methods.