can estimate parameters of this line, such as its slope and intercept from the GLM. From highschool
algebra, recall that straight lines can be represented using the mathematical equation y =
mx + c, where m is the slope of the straight line (how much does y change for unit change in x)
and c is the intercept term (what is the value of y when x is zero). In GLM, this equation is
represented formally as:
y = β0 + β1 x + ε
where β0 is the slope, β1 is the intercept term, and ε is the error term. ε represents the deviation
of actual observations from their estimated values, since most observations are close to the line
but do not fall exactly on the line (i.e., the GLM is not perfect). Note that a linear model can have
more than two predictors. To visualize a linear model with two predictors, imagine a threedimensional
cube, with the outcome (y) along the vertical axis, and the two predictors (say, x1
and x2) along the two horizontal axes along the base of the cube. A line that describes the
relationship between two or more variables is called a regression line, β0 and β1 (and other beta
values) are called regression coefficients, and the process of estimating regression coefficients is
called regression analysis. The GLM for regression analysis with n predictor variables is:
y = β0 + β1 x1 + β2 x2 + β3 x3 + … + βn xn + ε
In the above equation, predictor variables xi may represent independent variables or
covariates (control variables). Covariates are variables that are not of theoretical interest but
may have some impact on the dependent variable y and should be controlled, so that the
residual effects of the independent variables of interest are detected more precisely. Covariates
capture systematic errors in a regression equation while the error term (ε) captures random
errors. Though most variables in the GLM tend to be interval or ratio-scaled, this does not have
to be the case. Some predictor variables may even be nominal variables (e.g., gender: male or
female), which are coded as dummy variables. These are variables that can assume one of only
two possible values: 0 or 1 (in the gender example, “male” may be designated as 0 and “female”
as 1 or vice versa). A set of n nominal variables is represented using n–1 dummy variables. For
instance, industry sector, consisting of the agriculture, manufacturing, and service sectors, may
be represented using a combination of two dummy variables (x1, x2), with (0, 0) for agriculture,
(0, 1) for manufacturing, and (1, 1) for service. It does not matter which level of a nominal
variable is coded as 0 and which level as 1, because 0 and 1 values are treated as two distinct
groups (such as treatment and control groups in an experimental design), rather than as
numeric quantities, and the statistical parameters of each group are estimated separately.
The GLM is a very powerful statistical tool because it is not one single statistical method,
but rather a family of methods that can be used to conduct sophisticated analysis with different
types and quantities of predictor and outcome variables. If we have a dummy predictor
variable, and we are comparing the effects of the two levels (0 and 1) of this dummy variable on
the outcome variable, we are doing an analysis of variance (ANOVA). If we are doing ANOVA
while controlling for the effects of one or more covariate, we have an analysis of covariance
(ANCOVA). We can also have multiple outcome variables (e.g., y1, y1, … yn), which are
represented using a “system of equations” consisting of a different equation for each outcome
variable (each with its own unique set of regression coefficients). If multiple outcome variables
are modeled as being predicted by the same set of predictor variables, the resulting analysis is
called multivariate regression. If we are doing ANOVA or ANCOVA analysis with multiple
outcome variables, the resulting analysis is a multivariate ANOVA (MANOVA) or multivariate
ANCOVA (MANCOVA) respectively. If we model the outcome in one regression equation as a
132 | S o c i a l S c i e n c e R e s e a r c h
predictor in another equation in an interrelated system of regression equations, then we have a
very sophisticated type of analysis called structural equation modeling. The most important
problem in GLM is model specification, i.e., how to specify a regression equation (or a system of
equations) to best represent the phenomenon of interest. Model specification should be based
on theoretical considerations about the phenomenon being studied, rather than what fits the
observed data best. The role of data is in validating the model, and not in its specification.
Two-Group Comparison
One of the simplest inferential analyses is comparing the post-test outcomes of
treatment and control group subjects in a randomized post-test only control group design, such
as whether students enrolled to a special program in mathematics perform better than those in
a traditional math curriculum. In this case, the predictor variable is a dummy variable
(1=treatment group, 0=control group), and the outcome variable, performance, is ratio scaled
(e.g., score of a math test following the special program). The analytic technique for this simple
design is a one-way ANOVA (one-way because it involves only one predictor variable), and the
statistical test used is called a Student’s t-test (or t-test, in short).
The t-test was introduced in 1908 by William Sealy Gosset, a chemist working for the
Guiness Brewery in Dublin, Ireland to monitor the quality of stout – a dark beer popular with
19th century porters in London. Because his employer did not want to reveal the fact that it was
using statistics for quality control, Gosset published the test in Biometrika using his pen name
“Student” (he was a student of Sir Ronald Fisher), and the test involved calculating the value of
t, which was a letter used frequently by Fisher to denote the difference between two groups.
Hence, the name Student’s t-test, although Student’s identity was known to fellow statisticians.
The t-test examines whether the means of two groups are statistically different from
each other (non-directional or two-tailed test), or whether one group has a statistically larger
(or smaller) mean than the other (directional or one-tailed test). In our example, if we wish to
examine whether students in the special math curriculum perform better than those in
traditional curriculum, we have a one-tailed test. This hypothesis can be stated as:
H0: μ1 ≤ μ2 (null hypothesis)
H1: μ1 > μ2 (alternative hypothesis)
where μ1 represents the mean population performance of students exposed to the special
curriculum (treatment group) and μ2 is the mean population performance of students with
traditional curriculum (control group). Note that the null hypothesis is always the one with the
“equal” sign, and the goal of all statistical significance tests is to reject the null hypothesis.
How can we infer about the difference in population means using data from samples
drawn from each population? From the hypothetical frequency distributions of the treatment
and control group scores in Figure 15.2, the control group appears to have a bell-shaped
(normal) distribution with a mean score of 45 (on a 0-100 scale), while the treatment group
appear to have a mean score of 65. These means look different, but they are really sample
means ( ), which may differ from their corresponding population means (μ) due to sampling
error. Sample means are probabilistic estimates of population means within a certain
confidence interval (95% CI is sample mean + two standard errors, where standard error is the
standard deviation of the distribution in sample means as taken from infinite samples of the
population. Hence, statistical significance of population means depends not only on sample
Q u a n t i t a t i v e A n a l y s i s : I n f e r e n t i a l S t a t i s t i c s | 133
mean scores, but also on the standard error or the degree of spread in the frequency
distribution of the sample means. If the spread is large (i.e., the two bell-shaped curves have a
lot of overlap), then the 95% CI of the two means may also be overlapping, and we cannot
conclude with high probability (p<0.05) that that their corresponding population means are
significantly different. However, if the curves have narrower spreads (i.e., they are less
overlapping), then the CI of each mean may not overlap, and we reject the null hypothesis and
say that the population means of the two groups are significantly different at p<0
online job and work in india
Add Your Gadget Here
HIGHLIGHT OF THE WEEK
-
Survey Research Survey research a research method involving the use of standardized questionnaires or interviews to collect data about peop...
-
Inter-rater reliability. Inter-rater reliability, also called inter-observer reliability, is a measure of consistency between two or more i...
-
discriminant validity is exploratory factor analysis. This is a data reduction technique which aggregates a given set of items to a smalle...
-
can estimate parameters of this line, such as its slope and intercept from the GLM. From highschool algebra, recall that straight lines can...
-
Positivist Case Research Exemplar Case research can also be used in a positivist manner to test theories or hypotheses. Such studies are ra...
-
Quantitative Analysis: Descriptive Statistics Numeric data collected in a research project can be analyzed quantitatively using statistical...
-
Probability Sampling Probability sampling is a technique in which every unit in the population has a chance (non-zero probability) of being...
-
Experimental Research Experimental research, often considered to be the “gold standard” in research designs, is one of the most rigorous of...
-
Bivariate Analysis Bivariate analysis examines how two variables are related to each other. The most common bivariate statistic is the biva...
-
Case Research Case research, also called case study, is a method of intensively studying a phenomenon over time within its natural setting ...
Sunday 13 March 2016
Quantitative Analysis:
Inferential Statistics
Inferential statistics are the statistical procedures that are used to reach conclusions
about associations between variables. They differ from descriptive statistics in that they are
explicitly designed to test hypotheses. Numerous statistical procedures fall in this category,
most of which are supported by modern statistical software such as SPSS and SAS. This chapter
provides a short primer on only the most basic and frequent procedures; readers are advised to
consult a formal text on statistics or take a course on statistics for more advanced procedures.
Basic Concepts
British philosopher Karl Popper said that theories can never be proven, only disproven.
As an example, how can we prove that the sun will rise tomorrow? Popper said that just
because the sun has risen every single day that we can remember does not necessarily mean
that it will rise tomorrow, because inductively derived theories are only conjectures that may or
may not be predictive of future phenomenon. Instead, he suggested that we may assume a
theory that the sun will rise every day without necessarily proving it, and if the sun does not
rise on a certain day, the theory is falsified and rejected. Likewise, we can only reject
hypotheses based on contrary evidence but can never truly accept them because presence of
evidence does not mean that we may not observe contrary evidence later. Because we cannot
truly accept a hypothesis of interest (alternative hypothesis), we formulate a null hypothesis as
the opposite of the alternative hypothesis, and then use empirical evidence to reject the null
hypothesis to demonstrate indirect, probabilistic support for our alternative hypothesis.
A second problem with testing hypothesized relationships in social science research is
that the dependent variable may be influenced by an infinite number of extraneous variables
and it is not plausible to measure and control for all of these extraneous effects. Hence, even if
two variables may seem to be related in an observed sample, they may not be truly related in
the population, and therefore inferential statistics are never certain or deterministic, but always
probabilistic.
How do we know whether a relationship between two variables in an observed sample
is significant, and not a matter of chance? Sir Ronald A. Fisher, one of the most prominent
statisticians in history, established the basic guidelines for significance testing. He said that a
statistical result may be considered significant if it can be shown that the probability of it being
rejected due to chance is 5% or less. In inferential statistics, this probability is called the p-
130 | S o c i a l S c i e n c e R e s e a r c h
value, 5% is called the significance level (α), and the desired relationship between the p-value
and α is denoted as: p≤0.05. The significance level is the maximum level of risk that we are
willing to accept as the price of our inference from the sample to the population. If the p-value
is less than 0.05 or 5%, it means that we have a 5% chance of being incorrect in rejecting the
null hypothesis or having a Type I error. If p>0.05, we do not have enough evidence to reject
the null hypothesis or accept the alternative hypothesis.
We must also understand three related statistical concepts: sampling distribution,
standard error, and confidence interval. A sampling distribution is the theoretical
distribution of an infinite number of samples from the population of interest in your study.
However, because a sample is never identical to the population, every sample always has some
inherent level of error, called the standard error. If this standard error is small, then statistical
estimates derived from the sample (such as sample mean) are reasonably good estimates of the
population. The precision of our sample estimates is defined in terms of a confidence interval
(CI). A 95% CI is defined as a range of plus or minus two standard deviations of the mean
estimate, as derived from different samples in a sampling distribution. Hence, when we say that
our observed sample estimate has a CI of 95%, what we mean is that we are confident that 95%
of the time, the population parameter is within two standard deviations of our observed sample
estimate. Jointly, the p-value and the CI give us a good idea of the probability of our result and
how close it is from the corresponding population parameter.
General Linear Model
Most inferential statistical procedures in social science research are derived from a
general family of statistical models called the general linear model (GLM). A model is an
estimated mathematical equation that can be used to represent a set of data, and linear refers to
a straight line. Hence, a GLM is a system of equations that can be used to represent linear
patterns of relationships in observed data.
Figure 15.1. Two-variable linear model
The simplest type of GLM is a two-variable linear model that examines the relationship
between one independent variable (the cause or predictor) and one dependent variable (the
effect or outcome). Let us assume that these two variables are age and self-esteem respectively.
The bivariate scatterplot for this relationship is shown in Figure 15.1, with age (predictor)
along the horizontal or x-axis and self-esteem (outcome) along the vertical or y-axis. From the
scatterplot, it appears that individual observations representing combinations of age and selfesteem
generally seem to be scattered around an imaginary upward sloping straight line.
The easiest way to test for the above hypothesis is to look up critical values of r from
statistical tables available in any standard text book on statistics or on the Internet (most
software programs also perform significance testing). The critical value of r depends on our
desired significance level (α = 0.05), the degrees of freedom (df), and whether the desired test is
a one-tailed or two-tailed test. The degree of freedom is the number of values that can vary
freely in any calculation of a statistic. In case of correlation, the df simply equals n – 2, or for the
data in Table 14.1, df is 20 – 2 = 18. There are two different statistical tables for one-tailed and
two-tailed test. In the two-tailed table, the critical value of r for α = 0.05 and df = 18 is 0.44. For
our computed correlation of 0.79 to be significant, it must be larger than the critical value of
0.44 or less than -0.44. Since our computed value of 0.79 is greater than 0.44, we conclude that
there is a significant correlation between age and self-esteem in our data set, or in other words,
the odds are less than 5% that this correlation is a chance occurrence. Therefore, we can reject
the null hypotheses that r ≤ 0, which is an indirect way of saying that the alternative hypothesis
r > 0 is probably correct.
Most research studies involve more than two variables. If there are n variables, then we
will have a total of n*(n-1)/2 possible correlations between these n variables. Such correlations
are easily computed using a software program like SPSS, rather than manually using the
formula for correlation (as we did in Table 14.1), and represented using a correlation matrix, as
shown in Table 14.2. A correlation matrix is a matrix that lists the variable names along the
first row and the first column, and depicts bivariate correlations between pairs of variables in
the appropriate cell in the matrix. The values along the principal diagonal (from the top left to
the bottom right corner) of this matrix are always 1, because any variable is always perfectly
correlated with itself. Further, since correlations are non-directional, the correlation between
variables V1 and V2 is the same as that between V2 and V1. Hence, the lower triangular matrix
(values below the principal diagonal) is a mirror reflection of the upper triangular matrix
(values above the principal diagonal), and therefore, we often list only the lower triangular
matrix for simplicity. If the correlations involve variables measured using interval scales, then
this specific type of correlations are called Pearson product moment correlations.
Another useful way of presenting bivariate data is cross-tabulation (often abbreviated
to cross-tab, and sometimes called more formally as a contingency table). A cross-tab is a table
that describes the frequency (or percentage) of all combinations of two or more nominal or
categorical variables. As an example, let us assume that we have the following observations of
gender and grade for a sample of 20 students, as shown in Figure 14.3. Gender is a nominal
variable (male/female or M/F), and grade is a categorical variable with three levels (A, B, and
C). A simple cross-tabulation of the data may display the joint distribution of gender and grades
(i.e., how many students of each gender are in each grade category, as a raw frequency count or
as a percentage) in a 2 x 3 matrix. This matrix will help us see if A, B, and C grades are equally
126 | S o c i a l S c i e n c e R e s e a r c h
distributed across male and female students. The cross-tab data in Table 14.3 shows that the
distribution of A grades is biased heavily toward female students: in a sample of 10 male and 10
female students, five female students received the A grade compared to only one male students.
In contrast, the distribution of C grades is biased toward male students: three male students
received a C grade, compared to only one female student. However, the distribution of B grades
was somewhat uniform, with six male students and five female students. The last row and the
last column of this table are called marginal totals because they indicate the totals across each
category and displayed along the margins of the table.
Table 14.2. A hypothetical correlation matrix for eight variables
Table 14.3. Example of cross-tab analysis
Although we can see a distinct pattern of grade distribution between male and female
students in Table 14.3, is this pattern real or “statistically significant”? In other words, do the
above frequency counts differ from that that may be expected from pure chance? To answer
this question, we should compute the expected count of observation in each cell of the 2 x 3
cross-tab matrix. This is done by multiplying the marginal column total and the marginal row
total for each cell and dividing it by the total number of observations. For example, for the
male/A grade cell, expected count = 5 * 10 / 20 = 2.5. In other words, we were expecting 2.5
male students to receive an A grade, but in reality, only one student received the A grade.
Whether this difference between expected and actual count is significant can be tested using a
chi-square test. The chi-square statistic can be computed as the average difference between
Bivariate Analysis
Bivariate analysis examines how two variables are related to each other. The most
common bivariate statistic is the bivariate correlation (often, simply called “correlation”),
which is a number between -1 and +1 denoting the strength of the relationship between two
variables. Let’s say that we wish to study how age is related to self-esteem in a sample of 20
respondents, i.e., as age increases, does self-esteem increase, decrease, or remains unchanged.
If self-esteem increases, then we have a positive correlation between the two variables, if selfesteem
decreases, we have a negative correlation, and if it remains the same, we have a zero
correlation. To calculate the value of this correlation, consider the hypothetical dataset shown
in Table 14.1.
Q u a n t i t a t i v e A n a l y s i s : D e s c r i p t i v e S t a t i s t i c s | 123
Figure 14.2. Normal distribution
Table 14.1. Hypothetical data on age and self-esteem
The two variables in this dataset are age (x) and self-esteem (y). Age is a ratio-scale
variable, while self-esteem is an average score computed from a multi-item self-esteem scale
measured using a 7-point Likert scale, ranging from “strongly disagree” to “strongly agree.” The
histogram of each variable is shown on the left side of Figure 14.3. The formula for calculating
bivariate correlation is:
where rxy is the correlation, x and y are the sample means of x and y, and sx and sy are
the standard deviations of x and y. The manually computed value of correlation between age
and self-esteem, using the above formula as shown in Table 14.1, is 0.79. This figure indicates
124 | S o c i a l S c i e n c e R e s e a r c h
that age has a strong positive correlation with self-esteem, i.e., self-esteem tends to increase
with increasing age, and decrease with decreasing age. Such pattern can also be seen from
visually comparing the age and self-esteem histograms shown in Figure 14.3, where it appears
that the top of the two histograms generally follow each other. Note here that the vertical axes
in Figure 14.3 represent actual observation values, and not the frequency of observations (as
was in Figure 14.1), and hence, these are not frequency distributions but rather histograms.
The bivariate scatter plot in the right panel of Figure 14.3 is essentially a plot of self-esteem on
the vertical axis against age on the horizontal axis. This plot roughly resembles an upward
sloping line (i.e., positive slope), which is also indicative of a positive correlation. If the two
variables were negatively correlated, the scatter plot would slope down (negative slope),
implying that an increase in age would be related to a decrease in self-esteem and vice versa. If
the two variables were uncorrelated, the scatter plot would approximate a horizontal line (zero
slope), implying than an increase in age would have no systematic bearing on self-esteem.
Figure 14.3. Histogram and correlation plot of age and self-esteem
After computing bivariate correlation, researchers are often interested in knowing
whether the correlation is significant (i.e., a real one) or caused by mere chance. Answering
such a question would require testing the following hypothesis:
H0: r = 0
H1: r ≠ 0
H0 is called the null hypotheses, and H1 is called the alternative hypothesis (sometimes,
also represented as Ha). Although they may seem like two hypotheses, H0 and H1 actually
represent a single hypothesis since they are direct opposites of each other. We are interested in
testing H1 rather than H0. Also note that H1 is a non-directional hypotheses since it does not
specify whether r is greater than or less than zero. Directional hypotheses will be specified as
H0: r ≤ 0; H1: r > 0 (if we are testing for a positive correlation). Significance testing of directional
hypothesis is done using a one-tailed t-test, while that for non-directional hypothesis is done
using a two-tailed t-test.
Q u a n t i t a t i v e A n a l y s i s : D e s c r i p t i v e S t a t i s t i c s | 125
In statistical testing, the alternative hypothesis cannot be tested directly. Rather, it is
tested indirectly by rejecting the null hypotheses with a certain level of probability. Statistical
testing is always probabilistic, because we are never sure if our inferences, based on sample
data, apply to the population, since our sample never equals the population. The probability
that a statistical inference is caused pure chance is called the p-value. The p-value is compared
with the significance level (α), which represents the maximum level of risk that we are willing
to take that our inference is incorrect. For most statistical analysis, α is set to 0.05. A p-value
less than α=0.05 indicates that we have enough statistical evidence to reject the null hypothesis,
and thereby, indirectly accept the alternative hypothesis. If p>0.05, then we do not have
adequate statistical evidence to reject the null hypothesis or accept the alternative hypothesis
Univariate Analysis
Univariate analysis, or analysis of a single variable, refers to a set of statistical
techniques that can describe the general properties of one variable. Univariate statistics
include: (1) frequency distribution, (2) central tendency, and (3) dispersion. The frequency
distribution of a variable is a summary of the frequency (or percentages) of individual values
or ranges of values for that variable. For instance, we can measure how many times a sample of
respondents attend religious services (as a measure of their “religiosity”) using a categorical
scale: never, once per year, several times per year, about once a month, several times per
month, several times per week, and an optional category for “did not answer.” If we count the
number (or percentage) of observations within each category (except “did not answer” which is
really a missing value rather than a category), and display it in the form of a table as shown in
Figure 14.1, what we have is a frequency distribution. This distribution can also be depicted in
the form of a bar chart, as shown on the right panel of Figure 14.1, with the horizontal axis
representing each category of that variable and the vertical axis representing the frequency or
percentage of observations within each category.
Figure 14.1. Frequency distribution of religiosity
With very large samples where observations are independent and random, the
frequency distribution tends to follow a plot that looked like a bell-shaped curve (a smoothed
bar chart of the frequency distribution) similar to that shown in Figure 14.2, where most
observations are clustered toward the center of the range of values, and fewer and fewer
observations toward the extreme ends of the range. Such a curve is called a normal distribution.
Central tendency is an estimate of the center of a distribution of values. There are
three major estimates of central tendency: mean, median, and mode. The arithmetic mean
(often simply called the “mean”) is the simple average of all values in a given distribution.
Consider a set of eight test scores: 15, 22, 21, 18, 36, 15, 25, 15. The arithmetic mean of these
values is (15 + 20 + 21 + 20 + 36 + 15 + 25 + 15)/8 = 20.875. Other types of means include
geometric mean (nth root of the product of n numbers in a distribution) and harmonic mean (the
reciprocal of the arithmetic means of the reciprocal of each value in a distribution), but these
means are not very popular for statistical analysis of social research data.
122 | S o c i a l S c i e n c e R e s e a r c h
The second measure of central tendency, the median, is the middle value within a range
of values in a distribution. This is computed by sorting all values in a distribution in increasing
order and selecting the middle value. In case there are two middle values (if there is an even
number of values in a distribution), the average of the two middle values represent the median.
In the above example, the sorted values are: 15, 15, 15, 18, 22, 21, 25, 36. The two middle
values are 18 and 22, and hence the median is (18 + 22)/2 = 20.
Lastly, the mode is the most frequently occurring value in a distribution of values. In
the previous example, the most frequently occurring value is 15, which is the mode of the above
set of test scores. Note that any value that is estimated from a sample, such as mean, median,
mode, or any of the later estimates are called a statistic.
Dispersion refers to the way values are spread around the central tendency, for
example, how tightly or how widely are the values clustered around the mean. Two common
measures of dispersion are the range and standard deviation. The range is the difference
between the highest and lowest values in a distribution. The range in our previous example is
36-15 = 21.
The range is particularly sensitive to the presence of outliers. For instance, if the
highest value in the above distribution was 85 and the other vales remained the same, the range
would be 85-15 = 70. Standard deviation, the second measure of dispersion, corrects for such
outliers by using a formula that takes into account how close or how far each value from the
distribution mean:
where σ is the standard deviation, xi is the ith observation (or value), µ is the arithmetic mean, n
is the total number of observations, and Σ means summation across all observations. The
square of the standard deviation is called the variance of a distribution. In a normally
distributed frequency distribution, it is seen that 68% of the observations lie within one
standard deviation of
Subscribe to:
Posts (Atom)