A Four-Step Process for Identifying Missing Data
The first step in any examination of missing data is to determine the type of missing data involved. Here the researcher is concerned whether the missing data are part of the research design and under the control of the researcher or whether the “causes” and impacts are truly unknown. Let’s start with the missing data that are part of the research design and can be handled directly by the researcher.
- Ignorable Missing Data: The justification for designating missing data as ignorable is that the missing data process is operating at random (i.e., the observed values are a random sample of the total set of values, observed and missing) or explicitly accommodated in the technique used. There are three instances in which a researcher most often encounters ignorable missing data.
- The first example encountered in almost all surveys and most other data sets is the ignorable missing data process resulting from taking a sample of the population rather than gathering data from the entire population. In these instances, the missing data are those observations in a population that are not included when taking a sample. The purpose of multivariate techniques is to generalize from the sample observations to the entire population, which is really an attempt to overcome the missing data of observations not in the sample. The researcher makes these missing data ignorable by using probability sampling to select respondents. Probability sampling enables the researcher to specify that the missing data process leading to the omitted observations is random and that the missing data can be accounted for as sampling error in the statistical procedures. Thus, the missing data of the nonsampled observations are ignorable.
- A second instance of ignorable missing data is due to the specific design of the data collection process. Certain nonprobability sampling plans are designed for specific types of analysis that accommodate the nonrandom nature of the sample. Much more common are missing data due to the design of the data collection instrument, such as through skip patterns where respondents skip sections of questions that are not applicable
- A third type of ignorable missing data occurs when the data are censored. Censored data are observations not complete because of their stage in the missing data process. A typical example is an analysis of the causes of death. Respondents who are still living cannot provide complete information (i.e., cause or time of death) and are thus censored.
The primary issue in this step of the process is to determine whether the extent or amount of missing data is low enough to not affect the results, even if it operates in a nonrandom manner. If it is sufficiently low, then any of the approaches for remedying missing data may be applied. If the missing data level is not low enough, then we must first determine the randomness of the missing data process before selecting a remedy (step 3).
How Much Missing Data Is Too Much?
The most direct means of assessing the extent of missing data is by tabulating (1) the percentage of variables with missing data for each case and (2) the number of cases with missing data for each variable. This simple process identifies not only the extent of missing data, but any exceptionally high levels of missing data that occur for individual cases or observations. The researcher should look for any nonrandom patterns in the data, such as concentration of missing data in a specific set of questions, attrition in not completing the questionnaire, and so on. Finally, the researcher should determine the number of cases with no missing data on any of the variables, which will provide the sample size available for analysis if remedies are not applied.
f it is determined that the extent is acceptably low and no specific nonrandom patterns appear, then the researcher can employ any of the imputation techniques (step 4) without biasing the results in any appreciable manner. If the level of missing data is too high, then the researcher must consider specific approaches to diagnosing the randomness of the missing data processes (step 3) before proceeding to apply a remedy
Deletions Based on Missing Data
Imputation of Missing Data
Levels of Randomness of the Missing Data Process
- Missing At Random, or MAR
Missing data are termed missing at random (MAR) if the missing values of Y depend on X,but not on Y. In other words, the observed Y values represent a random sample of the actual Y values for each value of X, but the observed data for Y do not necessarily represent a truly random sample of all Y values. Even though the missing data process is random in the sample, its values are not generalizable to the population. Most often, the data are missing randomly within subgroups, but differ in levels between subgroups. The researcher must determine the factors determining the subgroups and the varying levels between groups.
- Missing Completely At Random, or MCAR
A higher level of randomness is termed missing completely at random (MCAR). In these instances the observed values of Y are truly a random sample of all Y values, with no underlying process that lends bias to the observed data. In simple terms, the cases with missing data are indistinguishable from cases with complete data.
Only MCAR allows for the use of any remedy desired. The distinction between these two levels is in the generalizability to the population
Diagnostic Tests for Levels of Randomness.
- The first diagnostic assesses the missing data process of a single variable Y by forming two groups: observations with missing data for Y and those with valid values of Y. Statistical tests are then performed to determine whether significant differences exist between the two groups on other variables of interest. Significant differences indicate the possibility of a nonrandom missing data process.
- A second approach is an overall test of randomness that determines whether the missing data can be classified as MCAR. This test analyzes the pattern of missing data on all variables and compares it with the pattern expected for a random missing data process. If no significant differences are found, the missing data can be classified as MCAR. If significant differences are found, however, the researcher must use the approaches described previously to identify the specific missing data processes that are nonrandom.
As a result of these tests, the missing data process is classified as either MAR or MCAR, which then determines the appropriate types of potential remedies. Even though achieving the level of MCAR requires a completely random pattern in the missing data, it is the preferred type because it allows for the widest range of potential remedies.
Imputation is the process of estimating the missing value based on valid values of other variables and/or cases in the sample. The objective is to employ known relationships that can be identified in the valid values of the sample to assist in estimating the missing values. However, the researcher should carefully consider the use of imputation in each instance because of its potential impact on the analysis
Comparison of Imputation Techniques for Missing Data
All of the imputation methods discussed in this section are used primarily with metric variables; nonmetric variables are left as missing unless a specific modeling approach is employed. Nonmetric variables are not amenable to imputation because even though estimates of the missing data for metric variables can be made with such values as a mean of all valid values, no comparable measures are available for nonmetric variables. As such, nonmetric variables require an estimate of a specific value rather than an estimate on a continuous scale. It is different to estimate a missing value for a metric variable, such as an attitude or perception—even income—than it is to estimate the respondent’s gender when missing.
- Nguồn: Hair, J. F. (2009). Multivariate data analysis
Không có nhận xét nào:
Đăng nhận xét