Day By Day
Notes for PBIS 189
Spring 2006
Activity: Go over syllabus. Take roll. Overview examples: Gilbert trial, election polls, spam
filters
Goals: Review
course objectives: collect data, summarize information, make inferences.
Reading: To The Student, pages xx-xxv
Activity: Video 1 – Overview of
Statistics, Discussion of variables and graphs.
Goals: Get a feel
for what questions we answer with statistics. Begin graphical summaries (describing data with pictures).
Skills:
á
Identify types of
variables. To choose the proper graphical displays, it is
important to be able to differentiate between Categorical and Quantitative (or
Numerical) variables.
á
Be familiar with
types of graphs. To graph categorical variables we use bar graphs or
pie graphs. To graph numerical
variables, we use histograms, stemplots, or CUMPLOT (TI-83 program). In
practice, most of our variables will be numerical but it is still important to
choose the right display.
Reading: Chapter 1 (Skip Time Plots)
Activity: Use the monarchs dataset to create the histograms, stemplots, and cumplots for
the variable "years reigned" separately for the Saxon Rulers (829 to
1066), the rulers from William I to Henry VI (1066 to 1471), the rulers from
Edward IV to Charles I (1461 to 1649), and the rulers from Charles II to
present (1660 to 1998). Compare and interpret the graphs. Identify shape, center, and
spread.
Useful commands for the calculator:
STAT EDIT (use one of the lists to enter data, L1 for example;
the other LÕs can be used too)
2nd
STATPLOT 1 On (Use this screen to
designate the plot settings. You
can have up to three plots on the screen at once. For now we will only use one at a time.)
ZOOM 9 This command centers the window around your data.
CUMPLOT This program I wrote plots the sorted data and
"stacks" them up.
Goals: Be able
to use the calculator to make a histogram or a cumplot. Be able to make a stemplot by hand.
Skills:
á
Summarize data into a
frequency table. The easiest way to make a frequency table is to TRACE the boxes in a histogram and record the classes and
counts. You can control the size and
number of the classes with Xscl and Xmin in the WINDOW menu. The decision as to
how many classes to create is arbitrary; there isn't a "right"
answer. One popular suggestion is
try the square root of the number of data values. For example, if there are 25 data points, use 5
intervals. If there are 50 data
points, try 7 intervals. This is a
rough rule; you should experiment with it. The TI-83 has a rule for doing this; I do not know what
their rule is. You should
experiment by changing the interval width and see what happens to the diagram.
á
Use the TI-83 to
create an appropriate histogram or cumplot. STAT PLOT is our main tool for
viewing distributions of data.
Histograms are common displays, but have flaws; the choice of class
width is troubling as it is not unique.
The cumplot is more reliable, but less common. For interpretation purposes, remember that in a histogram
tall boxes represent places with lots of data, while in a cumulative plot those
same high-density data places are steep.
á
Create a stemplot by
hand. The stemplot is a convenient manual display; it is
most useful for small datasets, but not all datasets make good stemplots. Choosing the "stem" and
"leaves" to make reasonable displays will require some practice. Some notes for proper choice of stems:
if you have many empty rows, you have too many stems. Move one column to the left and try again. If you have too few rows (all the data
is on just one or two stems) you have too few stems. Move to the right one digit and try again. Some datasets will not give good
pictures for any choice of stem, and some benefit from splitting or rounding
(see the example in the text).
á
Describe shape,
center, and spread. From each of our graphs, you should be able to make
general statements about the shape, center, and spread of the distribution of
the variable being explored.
Reading: Chapter 1 (Skip Time Plots)
Activity: Video 2 – Lightning
Research. Dance Fever example.
To calculate our summary statistics, we will use 1-Var Stats (to use List 1) or 1-Var Stats L2 for List 2, for example. There are two screens of output; we will be mostly concerned
with the mean x-bar, the standard deviation Sx, and the five-number summary on screen two.
Goals: Observe
the creation and interpretation of graphical displays in practice. Compare numerical measures of center.
Skills:
á
Understand the effect
of outliers on the mean. The mean (or average) is unduly influenced by outlying
(unusual) observations. Therefore,
knowing when your distribution is skewed is helpful.
á
Understand the effect
of outliers on the median. The median is almost completely
unaffected by outliers. For
technical reasons, though, the median is not as common in scientific
applications as the mean.
Reading: Chapter 2
Activity: Use the monarchs dataset to calculate the mean, the standard deviation, the
5-number summary, and the associated boxplot
for the variable "years reigned" separately for the Saxon Rulers (829
to 1066), the rulers from William I to Henry VI (1066 to 1471), the rulers from
Edward IV to Charles I (1461 to 1649), and the rulers from Charles II to
present (1660 to 1998).
Compare these measures with the corresponding
histogram and cumulative plot.
Note the similarities (where the data values are dense, and where they
are sparse) but especially note the differences. The boxplots and numerical measures cannot describe
shape. The histograms are hard to
use to compare two lists. The stem
and leaf is difficult to modify.
Answer these
questions:
1) Has the variable "years
reigned" changed over time?
2) How does a single case affect the
calculator's routines?
3) What information does the boxplot
disguise?
Goals: Summarize
data with numerical measures and boxplots. Compare these new measures with the histograms, stemplots,
and cumplots you made on Day 3.
Skills:
á
Use the TI-83 to
calculate summary statistics. Calculating may be as simple as entering numbers into
your calculator and pressing a button.
Or, if you are doing some things by hand, you may have to organize
information the correct way, such as listing the numbers from low to high. On the TI-83, the numerical measures
are accessed in 1-Var Stats function in the STAT CALC menu.
Please get used to using the statistical features of your calculator to
produce the mean. While I know you
can calculate the mean by simply adding up all the numbers and dividing by the
sample size, you will not be in the habit of using the full features of your
machine, and later on you will be 'missing the boat'.
á
Compare several lists
of numbers using boxplots. For two lists, the best simple approach is the
back-to-back stemplot. For more
than two lists, I suggest trying boxplots, side-by-side, or stacked. At a glance, then, you can assess which
lists have typically larger values or more spread out values, etc.
á
Understand
boxplots. You should know that the boxplots for some lists don't
tell the interesting part of those lists.
For example, boxplots do not
describe shape; you can only see where the quartiles are. Alternatively, you should know that the
boxplot can be a very good first
quick look.
Reading: Chapter 2
Activity: Create
the following lists:
1) A list of 10 numbers that has
only one number below the mean.
2) A list of 10 numbers that has the
standard deviation greater than the mean.
3) A list of 10 numbers that has a
standard deviation of zero.
For your fourth list start with any 21 numbers. Find a number N
such that 14 of the numbers in your list are within N of the average.
For example, pick a number N
(say 4), calculate the average plus 4, the average minus 4, and count how many
numbers in your list are between those two values. If the count is less than 14, try a larger number for N (bigger than 4). If the count is more than 14, try a smaller number for N (smaller than 4).
Finally, compare the standard deviation to the
Interquartile Range (IQR = Q3 - Q1).
Goals: Interpret
standard deviation as a measure of spread.
Skills:
á
Understand standard
deviation. At first, standard deviation will seem foreign to you,
but I believe that it will make more sense the more you become familiar with
it. In its simplest terms, the
standard deviation is non-negative number that measures how "wide" a
dataset is. One common
interpretation is that the range of a dataset is 4 standard deviations. Another interpretation is that the
standard deviation is roughly ¾ times IQR. Eventually we will use the standard deviation in our
calculations for statistical inference; until then, this measure is just
another summary statistic, and getting used to this number is your goal. The normal curve of Chapter 3 will
further help us understand standard deviation.
Reading: Chapter 3
Activity: Review Homework 1. Video 3 – Boston Beanstalks. Introduce the TI-83's normal
calculations.
Goals: Introduce
normal curve. Use TI-83 in place
of the standard normal table in the text.
Skills:
á
Using the TI-83 to
find areas under the normal curve.
When we have a distribution that
can be approximated with the bell-shaped normal curve, we can make accurate
statements about frequencies and percentages by knowing just the mean and the
standard deviation of the data.
Our TI-83 has 2 functions, normalcdf( and invNorm( which allow us to
calculate these percentages more easily and more accurately than the table in
the text. We use normalcdf( when we want the percentage as an answer and we use invNorm( when we already know the percentage but not the value
that gives that percentage.
Reading: Chapter 3
Activity: Practice normal calculations.
1)
Suppose SAT scores are distributed normally with mean 800 and standard
deviation (sd) 100. Estimate
the chance that a randomly chosen score will be above 720. Estimate the
chance that a randomly chosen score with be between 800 and 900. The top 20% of scores are above
what number? (This is called the
80th percentile.)
2) Find
the Interquartile Range (IQR) for the standard normal (mean 0, sd 1). Compare this to the standard deviation of
1.
3) Women aged 20 to 29 have
normally distributed heights with mean 64 and sd 2.7. Men have mean 69.3 with sd 2.8. what percent of women are taller than the average man, and
what percentage of men are taller than the average woman?
4)
Pretend we are manufacturing fruit snacks, and that the average weight
in a package is .92 ounces with sd 0.05.
What should we label the net weight on the package so that only 5 % of
packages are "underweight"?
5) Suppose that your
average commute time to work is 20 minutes, with an sd of 2 minutes. What time should you leave home to
arrive to work on time at 8:00?
(You may have to decide a reasonable value for the chance of being
late.)
Goals: Master normal calculations. Realize that summarizing using the normal curve is the
ultimate reduction in complexity, but only applies to data whose distribution
is actually bell-shaped.
Skills:
á
Memorize 68-95-99.7
rule. While we do rely on our technology to calculate areas
under normal curves, it is convenient to have some of the values committed to
memory. These values can be used
as rough guidelines; if precision is required, you should use the TI-83
instead. I will assume you know these
numbers by heart when we encounter the normal numbers again in chapters 10 and
13 through 19.
á
Understand that
summarizing with just the mean and standard deviation is a special case. We
have progressed from pictures like histograms to summary statistics like
medians, means, etc. to finally summarizing an entire list with just the mean
and the standard deviation.
However, this last step in our summarization only applies to lists whose distribution resembles the
bell-shaped normal curves. If the
data's distribution is skewed, or has any other shape, this level of
summarization is incomplete. Also,
it is important to realize that these calculations are only approximations.
Reading: Chapters 1 through 3
Activity: Presentations. Graphical (Chapter 1) and Numerical
(Chapter 2) Summaries
Collect or find some data; the quality of the data is not important for this
project. Use 3 to 5 lists of data;
make sure you have enough data so that your summaries are meaningful, say at
least 20 cases. Summarize your
data using both graphical and numerical summaries. Also, make sure you have at least one categorical variable
and at least one numerical variable.
Reading: Chapters 1 through 3
Activity: Exam 1. This first exam will cover graphical summaries (pictures), numerical
summaries (summary calculations) and normal curve calculations (areas under the
bell curve). Some of the questions
will be multiple choice. Others
will require you to show your worked out solution. Chapter reviews are an excellent source for studying for the
exams. Don't forget to review your
class notes and recall what we saw in the videos.
Activity: 1) Using the monarchs data, plot "years reigned" versus "death
age". Then guess what the correlation coefficient might be using
your calculator. Use the sample
diagrams on page 92 to guide you.
Finally, using your calculator, calculate
the actual value for the correlation coefficient and compare it to your guess.
2) Outlier effects. With the
dataset I give you in class, add an eighth point
in three different places and observe how the
correlations coefficient changes.
Goals: Display
two variables and measure (and interpret) linear association using the
correlation coefficient.
Skills:
á
Plot data with a
scatterplot. This will be as simple as entering two lists of
numbers into your TI-83 and pressing a few buttons, just as for histograms or
boxplots. Or, if you are doing
plots by hand you will have to first choose an appropriate axis scale and then
plot the points. You should also
be able to describe overall patterns in scatter diagrams and suggest tentative
models that summarize the main features of the relationship, if any.
á
Use the TI-83 to
calculate the correlation coefficient.
We will have to use the
regression function STAT CALC LinReg(ax+b) to
calculate correlation, r. First, you will have to have pressed DiagnosticOn. Access
this command through the CATALOG (2nd 0). If you
type ENTER after the LinReg(ax+b) command, the calculator assumes your lists are in columns
L1and L2; otherwise
you will type where they are, for example LinReg(ax+b) L2, L3.
á
Interpret the
correlation coefficient. You should know the range of the correlation
coefficient (-1 to +1) and what a 'typical' diagram looks like for various
values of the correlation coefficient.
Again, page 92 is your guide.
You should recognize some of the things the correlation coefficient does
not measure, such as the strength
of a non-linear pattern. You should also recognize how outliers
influence the magnitude of the correlation coefficient. One simple way to observe the effects
of outliers is to calculate the correlation coefficient with and without the
outlier in the dataset and compare the two values. If the values vary greatly (this is a judgment call) then
you would say the outlier is "influential".
Reading: Chapter 4
Activity: Video 4 – Manatees. Correlation summary.
1) The variables can be entered in
any order; correlation is a fact about a pair of variables.
This will be different when we get to regression; there, the order the
variables are presented matters.
2) We must have numerical variables to calculate correlation. For categorical variables, we will use
contingency tables, in Chapter 6.
3) High correlation does not
necessarily mean a straight line scatterplot. US population growth is an example.
4) Correlation is not resistant;
the dataset from Day 11 showed that the placement of a single point in the
scatterplot can greatly influence the value of the correlation.
Goals: See scatterplots
and correlation in practice.
Understand correlations limitations and features.
Skills:
á
Recognize the proper
use of correlation, and know how it is abused. Correlation
measures straight line
relationships. Any departures from
that model make the correlation coefficient less reliable as a summary measure.
Just as for the standard deviation and the mean, the correlation coefficient is
affected by outliers. Therefore,
it is extremely important to be aware of data that is unusual. Some 2-dimensional outliers are hard to
detect with summary statistics; scatterplots are a must then.
Reading: Chapter 5
Activity: 1) Using the Olympic data, fit a regression line to predict the
2004 and 2008 race results.
2) Revisit outliers dataset,
adding regression lines.
Goals: Practice
using regression with the TI-83.
We want the regression equation, the regression line superimposed on the
plot, the correlation coefficient, and we want to be able to use the line to
predict new values.
Skills:
á
Fit a line to data. This
may be as simple as 'eyeballing' a straight line to a scatter plot. However, to be more precise, we will
use least squares, STAT CALC LinReg(ax+b) on the
TI-83, to calculate the coefficients, and VARS Statistics EQ RegEQ to type the equation in the Y= menu.
You should also be able to sketch a line onto a scatter plot (by hand)
by knowing the regression coefficients.
á
Interpret regression
coefficients. Usually, we want to only interpret slope, and slope is
best understood by examining the units involved, such as inches per year or
miles per gallon, etc. Because
slope can be thought of as "rise" over "run", we are
looking for the ratio of the units involved in our two variables. More precisely, the slope tells us the
change in the response variable for a unit change in the explanatory
variable. We don't typically
bother interpreting the intercept, as zero is often outside of the range of
experimentation.
á
Estimate/predict new
observations using the regression line.
Once we have calculated a regression
equation, we can use it to predict new responses. The easiest way to use the TI-83 for this is to TRACE on the regression line. You may need to use up and down arrows to toggle back and
forth from the plot to the line.
You may also just use the equation itself by multiplying the new x-value by the slope and adding the intercept. (This is exactly what TRACE is doing.)
á
Understand the
limitations and strengths of linear regression. Quite simply,
linear regression should only be used with scatterplots that are roughly linear
in nature. That seems
obvious. However, there is nothing
that prevents us from calculating
the numbers for any data set we can input into our TI-83's. We have to realize what our data looks
like before we calculate the regression;
therefore a scatter plot is essential. In the presence of outliers and
non-linear patterns, we should avoid drawing conclusions from the fitted
regression line.
Reading: Chapter 5
Activity: Try to summarize and predict the
population growth in the US. Using
the census data, see if any of the other regression functions in the STAT CALC menu are good models.
Goals: Explore
non-linear regressions on the TI-83.
Skills:
á
Effectively model
using non-linear regression functions.
When we have a relationship
that is non-linear, we try other models.
Because straight lines are easy for us to understand (we are accustomed
to them), the coefficients have meaning.
Some of the other functions available to you are also interpretable,
with some familiarity (which I am not expecting from you) but others have
coefficients that are uninterpretable.
Our main use of these alternate functions is to see the fitted model on
the scatterplot. (We add them to
the scatterplot in the same way as for linear regression: VARS Statistics
EQ RegEQ from the the Y= menu.)
á
Understand that a
high value of r2 is not
necessarily a good fit. We have seen that when r2 = 1, we have a perfect fit. So, you might assume that values very close to 1 are
indicators of very good fits, but this is not necessarily the case. The population data should show us some
high values of r2 that are
poor predictive models. Again, we
need the scatterplot along with the equation to make proper conclusions.
Reading: Chapter 6
Activity: Video 5 – Smoking. Introduce tables of categorical data.
Goals: Introduce
association for categorical variables.
Explore Simpson's paradox.
Skills:
á
Understand that cause
and effect is difficult to establish.
The slogan is
"Association is not the same as Causation." We will encounter this many times throughout the rest of the
course. In the next set of
material (Chapters 7 and 8) we will discuss ways to produce data from which we can draw conclusions about causation.
á
Recognize Simpson's
paradox. Sometimes when data is summarized over several
sub-categories, an association can be reversed. It seems contrary to good common sense, but it is actually
the effects of a lurking variable, and the phenomenon is known as Simpson's
paradox. You should be able to
recognize situations where this paradox might occur. Not all tables of categorical variables will exhibit this
paradox; the tables must be comparing rates over several groups.
Reading: Chapter 6
Activity: Expected Tables.
Goals: Develop intuition
for when the observed and expected tables are too different.
Skills:
á
Create the table of
expected counts. The primary method of analyzing categorical tables is
comparing the observed data to a table of expected counts. (This material comes from Chapter 20,
but I will not expect you to master Chapter 20.)
á
Recognize when an
association is present. When two categorical variables are associated (much
like when two numerical variables are correlated) we detect this with the c2 test. I will show you a way to decide if the
differences in the tables are too great (STAT TESTS c2-Test You must
have the observed table in a matrix.
The expected table will be stored in another matrix. If p
< .05, we conclude the two tables are quite different.)
Reading: Chapters 4 through 6
Activity: Presentations. Regression/Correlation (Chapters 4 and
5)
Pick one of the 50 states. Predict
the population in the year 2010 using a regression function (not necessarily
linear though). Describe how you
decided upon your model, and explain how good you think your prediction is.
Reading: Chapters 4 through 6
Activity: Exam 2. This second exam covers scatterplots, correlation,
regression, and associations in categorical data. Some of the questions will be multiple choice. Others will require you to show your
work. Chapter reviews are an
excellent source for studying for the exams. Don't forget to review your class notes and recall what we
saw in the videos.
Activity: Video 6 – Frito Lay. History of polls.
Goals: Introduce
sampling. Identify biases. Explore why non-random samples are not trustworthy.
Skills:
á
Understand the issues
of bias. We seek representative samples. The "easy" ways of sampling,
samples of convenience and voluntary response samples, may or may not produce
good samples, and because we don't know the chances of subjects being in such
samples, they are poor sampling methods.
Even when probability methods are used, biases can spoil the
results. Avoiding bias is our
chief concern in designing surveys.
á
Huge samples are not
necessary. One popular misconception about sampling is that if
the population is large, then we need a proportionately large sample. This is just not so. My favorite counter-example is our
method of tasting soup. To find
out if soup tastes acceptable, we mix it up, then sample from it with a
spoon. It doesn't matter to us
whether it is a small bowl of soup, or a huge vat, we still use only a
spoonful. The situation is the
same for statistical sampling; we use a small "spoon", or
sample. The fundamental
requirement though is that the "soup" (our population) is "well
mixed" (as in a simple random sample – see Day 20).
Reading: Chapter 7
Activity: Creating random samples. We will use three methods of sampling
today: dice, Table B in our book, and our calculator. To make the problem feasible, we will only use a population
of size 6. (I know this is
unrealistic in practice, but the point today is to see how randomness works,
and hopefully trust that the results extend to larger problems.) Pretend that the items in our
population (perhaps they are people) are labeled 1 through 6. For each of our methods, you will have
to decide in your group what to do with "ties". Keep in mind the goal of simple random
sampling: at each stage, each remaining item has an equal chance to be the next
item selected.
Using dice, generate a sample of three people. Repeat 20 times.
Using Table B, starting at any haphazard location, select three people. Repeat 20 times.
Using your TI-83, select three people.
The command randInt(2,4,5) will produce 5
numbers between 2 and 4, inclusive, for example.
Your group should have drawn 60 samples at the end. Keep careful track of which samples you selected; record
your results in order, as 125 or 256, for example. (125 would mean persons 1, 2, and 5 were selected.) We will pool the results of everyone's
work together on the board.
Goals: Gain
practice taking random samples.
Understand what a simple random sample is. Become familiar
with randInt(. Accept that calculator is random.
Skills:
á
Know the definition
of a Simple Random Sample (SRS). Simple Random Samples can be defined in two ways:
1) An SRS is a sample where, at
each stage, each item has an equal chance to be the next item selected.
2) A scheme were every possible
sample has an equal chance to be the
sample results in an SRS.
á Select an SRS from a list of items. The TI-83 command randInt( will select numbered items from a list randomly. If a number selected is already in the list, ignore that nu