Thứ Bảy, 1 tháng 6, 2024

Sample Size

 The sample size used in multiple regression is perhaps the single most influential element under the control of the researcher in designing the analysis. The effects of sample size are seen most directly in the statistical power of the significance testing and the generalizability of the result. Both issues are addressed in the following sections

(Nguồn: https://real-statistics.com/wp-content/uploads/2012/11/statistical-power-chart.png)


STATISTICAL POWER AND SAMPLE SIZE

  • The size of the sample has a direct impact on the appropriateness and the statistical power of multiple regression. 
  • Small samples, usually characterized as having fewer than 30 observations (?), are appropriate for analysis only by simple regression with a single independent variable. Even in these situations, only strong relationships can be detected with any degree of certainty. Likewise, large samples of 1,000 observations or more make the statistical significance tests overly sensitive, often indicating that almost any relationship is statistically significant. With such large samples the researcher must ensure that the criterion of practical significance is met along with statistical significance.
  • The researcher can also consider the role of sample size in significance testing before collecting data. If weaker relationships are expected, the researcher can make informed judgments as to the necessary sample size to reasonably detect the relationships, if they exist.

GENERALIZABILITY AND SAMPLE SIZE

  • In addition to its role in determining statistical power, sample size also affects the generalizability of the results by the ratio of observations to independent variables.
  • A general rule is that the ratio should never fall below 5:1, meaning that five observations are made for each independent variable in the variate (Why?). Although the minimum ratio is 5:1, the desired level is between 15 to 20 observations for each indepenent variable. When this level is reached, the results should be generalizable if the sample is representative. However, if a stepwise procedure is employed, the recommended level increases to 50:1 because this technique selects only the strongest relationships within the data set and suffers from a greater tendency to become sample-specific. In cases for which the available sample does not meet these criteria, the researcher should be certain to validate the generalizability of the results.
  • As this ratio falls below 5:1, the researcher encounters the risk of overfitting the variate to the sample, making the results too specific to the sample and thus lacking generalizability. 

Degrees of Freedom as a Measure of Generalizability.

  • We can perfectly predict one observation with a single variable, but what about all the other observations? Thus, the researcher is searching for the best regression model, one with the highest predictive accuracy for the largest (most generalizable) sample. The degree of generalizability is represented by the degrees of freedom, calculated as:
Degrees of freedom (df) = Sample size - Number of estimated parameters
  • The larger the degrees of freedom, the more generalizable are the results. Degrees of freedom increase for a given sample by reducing the number of independent variables. Thus, the objective is to achieve the highest predictive accuracy with the most degrees of freedom. 
  • No specific guidelines determine how large the degrees of freedom are, just that they are indicative of the generalizability of the results and give an idea of the overfitting for any regression model

Tài liệu gốc:

  • Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2013). Multivariate data analysis (8th ed.). Boston: Cengage.

Không có nhận xét nào:

Đăng nhận xét

Sandbox

Thuật ngữ "sandbox" trong bối cảnh công nghệ được dùng để chỉ một môi trường thử nghiệm an toàn, trong đó các phần mềm, chương tr...