Optimally Combining Censored and Uncensored Datasets
Economists and other social scientists often face situations where they have access to two datasets that they can use but one set of data suffers from censoring or truncation. If the censored sample is much bigger than the uncensored sample, it is common for researchers to use the censored sample alone and attempt to deal with the problem of partial observation in some manner. Alternatively, they simply use only the uncensored sample and ignore the censored one so as to avoid biases. It is rarely the case that researchers use both datasets together, mainly because they lack guidance about how to combine them. In this paper, we develop a tractable semiparametric framework for combining the censored and uncensored datasets so that the resulting estimators are consistent, asymptotically normal, and use all information optimally. When the censored sample, which we refer to as the master sample, is much bigger than the uncensored sample (which we call the refreshment sample), the latter can be thought of as providing identification where it is otherwise absent. In contrast, when the refreshment sample is large and could typically be used alone, our methodology can be interpreted as using information from the censored sample to increase effciency. To illustrate our results in an empirical setting, we show how to estimate the effect of changes in compulsory schooling laws on age at first marriage, a variable that is censored for younger individuals. We also demonstrate how refreshment samples for this application can be created by matching cohort information across census datasets.