Study Notes: AP Statistics

Review Notes of AP Statistics. Based on The Princeton Review: Cracking the AP Statistics Exam 2019 Edition.

Chapter 1 Exploring Data

Collecting Data

The different methods for collecting data are known as descriptive methods.

Type of Variables

1. Categorical (Sex, Eye color)
2. Quantitative (Weight, Age)

Type of Data

1. Univariate Data (One measurement of each object)
2. Bivariate Data (Two measurement of each object)

Type of Descriptive Methods

1. Tabular method
2. Graphical method
3. Numerical method

Tabular method

Notations

1. The letter n is used to denote the number of observations in a data set.
2. The letter f, frequency of a value is the number of times that observation occurs.
3. The letter rf, relative frequency of a value is the ratio of the frequency to the total number of observations. ($\small rf=\frac{f}{n}$)
4. The letter cf, cumulative frequency is the number of observations less than or equal to a specified value.
5. A frequency distribution table is a table giving all possible values of a variable and their frequencies.

For the categorical variables, there’s no sensible ordering of values. So, cumulative frequency is meaningless.

Graphical method

1. Bar Chart
2. Pie Chart

Graphical method for quantitative data

Notations
1. Center: There will roughly the same number of data points to the left and right of the center. For most graphs, the center will be roughly the same as the median and the mean.
2. Spread: The distance between data points from the center.
3. Shape: The shape of a distribution can tell us where the most data is.
• Symmetric distribution: The left is similar to the right.
• Right-skewed: The left is higher than the right.
• Left-skewed: The right is higher than the left.

Patterns and Deviations from Patterns

1. Cluster
2. Gap
3. Outlier: An observation which is significantly different from the rest of the data.

Summarizing Distribution

Population: The entire group of individuals or things that we are interested in.
Sample: The part of population that is actually studied.

Numerical methods for continuous variables

1. Measures of central tendency
3. Measures of position

Measure of central tendency

Mean

The population mean ($\small \mu$): $\small \mu=\frac{\sum_{i=1}^{N}X_{i}}{N}$
The sample mean ($\small \bar{X}$): $\small \bar{X}=\frac{\sum_{i=1}^{n}X_{i}}{n}$

Median

The observation in the middle of the data set.

Variation

Range: largest measurement – smallest measurement
Interquartile range: The range of middle 50 percent of the data. ($\small IQR= Q_{3}-Q_{1}$)
Population standard deviation: $\small \sigma=\sqrt{\frac{\sum_{i=1}^{N}(X_{i}-\mu)^{2}}{N}}$
Sample standard deviation: $\small s=\sqrt{\frac{\sum_{i=1}^{N}(X_{i}-\bar{X})^{2}}{N}}$
Variance: The square of Standard deviation.
When there are outliers, IQR seems more effective than Standard deviation.
A large standard deviation indicates a larger spread among the measurements.

Measure of Position

Percentiles: $\small P_{k}$ is kth percentile.
Quartiles: $\small Q_{1}$ is the 25th percentile.
Standardized scores or z-scores: $\small \text{z-score}=\frac{\text{measurement}-\text{mean}}{\text{standard deviation}}$
z-score calculates the distance between observation and mean.

Boxplots

1. Rectangular box ranges from $\small Q_{1}$ to $\small Q_{3}$.
2. Draw a vertical line at the median.
3. Whisker length = 1.5IQR.
4. $L=Q_{1} - 1.5 \text{IQR}$
5. $U=Q_{3} + 1.5 \text{IQR}$
6. Draw the line to the farthest observation less than the range of L or U.
7. Plot the points larger than the range of L or U.

Exploring Bivariate Data

Scatterplot

Shape: A scatterplot tells us whether the nature of the relation between the two variables is linear or nonlinear.
Direction: Increasing or upward trend is a positive direction, vice versa.
Strengh of relationship: If the trend of the data can be described with a line or a curve and the data is closed to the line, the plot indicates a strong relationship, vice versa.

Correlation Coefficient

The correlation coefficient is denoted by r: $\small -1 \leq r \leq 1$.
If r is positive, the direction of the plot is positive, vice versa.
If r is equal to 1 or -1, it indicates a perfect correlation between two variables.
The farther away the correlation gets from 0, the stronger the relationship between the two variables.
The correlation coefficients are usually computed using a calculator.

Least-Squares regression line

Regression line: $\small Y=\beta_{0}+\beta_{1}X+\varepsilon$
Y is the dependent variable or response variable.
X is the independent variable or explanatory variable.
$\small \beta_{0}$ is the y-intercept, which is the value of Y for X = 0.
$\small \beta_{1}$ is the slope of the line, which gives the amount of change in Y for every unit change in X.
$\small \varepsilon$ is the random error, which is the difference between predicted value and observation.
$\small \hat{}y$ is the predicted value of Y for a given value of X.

Least-squares regression line is the line that minimizes the sum of the squres of the residuals.
The line of best fit will always pass through the point $\small (\hat{X},\hat{Y})$ and will always have the slope $\small \beta_{1}=r\frac{S_{y}}{S_{x}}$.

Transforamtions to Achieve Linearity

After drawing the scatterplot, you may observe that a linear model won’t fit the plot. However, you could either use non-linear models, which are not tested on the AP exam, or use a transformation to achieve linearity.

1. The log transformation: $\small Z=ln(Y)$
2. The square root transformation: $\small Z=\sqrt{Y}=Y^{\frac{1}{2}}$
3. The reciprocal transformation: $\small Z=\frac{1}{Y}$
4. The square transformation: $\small Z=Y^{2}$
5. The power transformation: $\small Y=aX^{b}$

Conditional Relative Frequencies and Association

A table of data classified by r categories of classification criterion 1 and c categories of classification criterion 2 is known as an r*c contingency table.
The conditional relative frequency is the relative frequency of one category given that the other category has occurred.
The expect number of measurements in a given cell of the contingency table is equal to $\small \frac{(\text{row total})(\text{column total})}{\text{total number of measurements}}$
We can compare the expected frequency with the observed frequency to determine if there’s an association between the two categories.