User Tools

Site Tools


seminars:stat:10082015

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

seminars:stat:10082015 [2015/08/31 11:50]
qiao created
seminars:stat:10082015 [2015/09/05 22:29] (current)
qiao
Line 4: Line 4:
 <WRAP 70% center> <WRAP 70% center>
 ^  **DATE:​**|Thursday,​ October 8, 2015 | ^  **DATE:​**|Thursday,​ October 8, 2015 |
-^  **TIME:​**|1:​15pm to 2:40pm |+^  **TIME:**|**1:00pm to 2:30pm** (note the different time and longer duration) ​|
 ^  **LOCATION:​**|WH 100E | ^  **LOCATION:​**|WH 100E |
 ^  **SPEAKER:​**|Qiyi Lu, Binghamton University | ^  **SPEAKER:​**|Qiyi Lu, Binghamton University |
Line 13: Line 13:
 <WRAP center box 80%> <WRAP center box 80%>
 <WRAP centeralign>​**Abstract**</​WRAP>​ <WRAP centeralign>​**Abstract**</​WRAP>​
-The High-Dimensional,​ Low-Sample Size (HDLSS) ​data setting ​is very challenging ​for statistical learning ​and it occurs ​in many applied areas such as gene expression micro-array analysis, facial recognition,​ medical image analysis and text classification. In many real applications,​ it is costly to manually place labels on observations; ​hence it is often that only a small portion of labeled data is available while a large number of observations are left without ​a labelIt is a great challenge to obtain good statistical learning performance through the labeled data alone, especially when the dimension is greater than the size of the labeled data. We are interested in learning this type of data in two areas: classification and significance analysis. +High-Dimensional,​ Low-Sample Size (HDLSS) is very challenging ​data setting in statistics and statistical learning, including regression, classification,​ clustering, etc. The HDLSS data appear ​in many applied areas such as gene expression micro-array analysis, facial recognition,​ medical image analysis and text classification. In the context of classification,​ in many real applications,​ it is costly to manually place the class labels on observations; ​as a consequence, ​often only a small portion of labeled data is available while a large number of observations are left without ​labelsSuch partially ​labeled data are very difficult to analyze ​in the HDLSS setting. ​In this dissertationwe study the HDLSS partially labeled data in two aspectsWe push forward ​the frontier of knowledge by creating ​new classification method ​and a significance analysis ​tool for the partially labeled data.
- +
-Classification is an important tool with many useful applications. Among the many classification methods, Fisher'​s Linear Discriminant Analysis (LDA) is a traditional model-based approach which makes use of the covariance information. However, ​in the HDLSS setting, LDA cannot be directly deployed because the sample covariance is not invertibleWhile there are modern methods designed to deal with high-dimensional datait is hard to obtain good classification performance on the labeled data alone when the data are partially labeled. In order to overcome these issues, we propose a semi-supervised sparse LDA classifier to take advantage of the seemingly useless unlabeled ​data. They provide additional information which helps to boost the classification performance in some situations. +
- +
-Before applying ​a classification/​clustering ​method, a natural question is whether predefined classes are really different from one another, or whether clusters are really there. Although they are challenging questions in the HDLSS setting, there has been some recent development for both. We propose ​a significance analysis ​approach ​for a different type of data, namely partially labeled data. Our method makes use of the whole data and tries to test the class difference as if all the labels were observed. Compared to a testing method that ignores the label information,​ our method provides a greater power, meanwhile, maintaining the size. +
- +
-Both of these two proposed methods are designed for partially labeled data in the HDLSS setting. Theoretical properties are studied. Our simulated and real data examples help to understand and illustrate the usefulness of the proposed methods. +
- +
  
 +Classification is an important tool with many useful applications. Among the many existing classification methods, Fisher'​s Linear Discriminant Analysis (LDA) is a traditional model-based approach which makes use of the distributional information such as the covariance of the features. However, in the HDLSS setting, LDA cannot be directly deployed because the sample covariance is not invertible. While there are modern methods designed to deal with the high dimensionality,​ it is difficult to obtain good performance for the partially labeled data when the analysis is based on the labeled data alone, due to the scarcity of the data. In order to overcome the difficulty, and to fully make use of the seemingly useless unlabeled data, we propose a semi-supervised sparse LDA classifier in this dissertation. Our method combines LDA, a method-based approach, with some machine learning oriented components. The extra components help to extract useful information from the unlabeled data which can boost the classification performance in some situations.
  
 +Before learning a data set, a natural question to ask is whether the predefined classes are really different from one another (in the context of classification),​ or whether clusters are really there (in the context of clustering). Such a question may be answered by significance tests. Even in the challenging HDLSS setting, there have been some recent developments. However, a significance analysis tool for the partially labeled data has not been developed in the HDLSS setting. In this dissertation,​ we propose a significance analysis approach for the HDLSS partially labeled data. Our method makes use of the whole data and tries to test the class difference as if all the labels were observed. Compared to a testing method that ignores the label information,​ our method provides a greater power, meanwhile, maintaining the size.
  
 +In studying both aspects of the partially labeled data, we provide theoretical justifications to the methods proposed. In particular, our theoretical study has emphasized on the HDLSS setting, shedding light on the usefulness of the proposed methods. Lastly, comprehensive simulation and data examples have illustrated the effectiveness of the methods.
 </​WRAP>​ </​WRAP>​
  
seminars/stat/10082015.1441036216.txt · Last modified: 2015/08/31 11:50 by qiao