Day By Day
Notes for MATH 201
Fall 2006
Activity: Go over syllabus. Take roll. Overview examples: Gilbert trial, election polls, spam
filters
Goals: Review
course objectives: collect data, summarize information, make
inferences.
Reading: To The Student, pages
xxxi-xxxiv.
Activity: Discussion of variables and
graphs. From a list of numbers,
communicate the important information to the person next to you. (Work in pairs or groups.) For your list of numbers, make a
frequency table and a histogram.
Useful commands for the calculator:
STAT
EDIT (use one of the lists to enter data,
L1 for example; the other L's can be used too.)
2nd
STATPLOT 1 On (Use this screen to
designate the plot settings. You
can have up to three plots on the screen at once. For now we will only use one at a time.)
ZOOM
9 This command centers the window around
your data.
In your description to your
neighbor, keep in mind these terms:
symmetry, skew, center, spread, mode, outlier. Also make sure that you try different window settings for
your histogram.
Goals: Begin
graphical summaries (describing data with pictures). Be able to use the calculator to make a
histogram.
Skills:
…
Identify types of
variables. To choose the proper graphical displays, it
is important
to be able to differentiate between Categorical (or Qualitative) and
Quantitative (or Numerical) variables.
…
Be familiar with
types of graphs. To graph categorical variables we use bar graphs or
pie graphs. To graph numerical
variables, we use histograms, stem plots, or QUANTILE (a TI-83 program we will explore on Day 3). In practice, most of our variables will
be numerical but it is still important to choose the right
display.
…
Summarize data into a
frequency table. The easiest way to make a frequency table is to first
use the TI-83 to make a histogram and then to TRACE over the boxes and record the classes and
counts. You can control the size
and number of the classes with Xscl
and Xmin
in the WINDOW menu. The decision as to
how many classes to create is arbitrary; there isn't a "right"
answer, or rather all choices of Xscl
and Xmin are "right"
answers. One popular suggestion is
try the square root of the number of data values. For example, if there are 25 data points, use 5
intervals. If there are 50 data points, try 7
intervals. This is a rough rule;
you should experiment with it. The
TI-83 has a rule for doing this; I do not know what their rule is. You should develop your intuitions by
changing the interval width Xscl and
starting point Xmin and see what happens to
the display.
…
Know how to create
and interpret graphs for categorical variables. The two main
graphs for categorical variables are pie graphs and bar charts. Pie graphs are difficult to make by
hand, but popular on computer programs like Excel. Bar charts are also common on spreadsheets. Data represented by pie graphs and bar
charts usually are expressed as percents of the whole; thus they add to
100%. The ordering of categories
is arbitrary; therefore concepts such as skew and center make no sense.
Reading: Section 1.1. (Skip Time Plots and
Time Series.)
Activity: Use the "Arizona Temps" dataset to practice creating
the histograms, stem plots, and quantile plots for several lists. Compare and interpret the graphs. Identify shape, center, and spread.
QUANTILE is a program I wrote that plots the sorted data in a
list and "stacks" them up.
This is also known as a quantile plot. Basically we are graphing the data value versus its rank, or
percentile, in the dataset. The
syntax is PRGM EXEC QUANTILE ENTER.
Answer these questions:
1) Do any of the lists have
outliers?
2) What information does the stem
plot show that the histogram hides?
3) What information does the
quantile plot show that the stem plot hides?
Goals: Be able
to use the calculator to make (and be able to interpret) a quantile plot, using
the program QUANTILE. Be able to make a stem plot
by hand.
Skills:
…
Use the
TI-83 to create
an appropriate histogram or quantile plot. STAT PLOT and QUANTILE are our two main tools for viewing distributions of
data. Histograms are common
displays, but have flaws; the choice of class width is troubling as it is not
unique. The quantile plot is more
reliable, but less common. For
interpretation purposes, remember that in a histogram tall boxes represent
places with lots of data, while in a quantile plot those same high-density data
places are represented by steepness.
…
Create a stem plot by
hand. The stem plot is a convenient manual display; it is
most useful for small datasets, but not all datasets make good stem plots. Choosing the "stem" and
"leaves" to make reasonable displays will require some practice. Some notes for proper choice of stems:
if you have many empty rows, you have too many stems. Move one column to the left and try again. If you have too few rows (all the data
is on just one or two stems) you have too few stems. Move to the right one digit and try again. Some datasets will not give good
pictures for any choice of stem, and some benefit from splitting or rounding
(see the example on page 13).
…
Describe shape,
center, and spread.
From each of our graphs, you should be able to make
general statements about the shape, center, and spread of the distribution of
the variable being explored. One
of the main conclusions we want to make about lists of data when we are doing
inference (Chapters 6 to 8) is whether the data is close to symmetric; many
times "close enough" is, well, close enough! We will discuss this in more detail
when we see the Central Limit Theorem in Chapter 5.
Reading: Section 1.2.
Activity: Dance Fever example. Use the "Arizona Temps" dataset to calculate the mean,
the standard deviation, the 5-number summary, and the associated box plot for
any of the variables.
Compare these measures with the corresponding histograms and quantile plots you
did on Day 2. Note the
similarities (where the data values are dense, and where they are sparse) but
especially note the differences.
The box plots and numerical measures cannot describe shape very
well. The histograms are hard to
use to compare two lists. The stem
and leaf is difficult to modify.
Answer these questions:
1) Are high and low temperatures
distributed the same way, other than the obvious fact that highs are higher
than lows?
2) How does a single case affect the
calculator's routines? (What if we
had had an outlier?)
3) What information does the box
plot disguise?
To calculate our summary statistics, we will use 1-Var Stats (to use List 1) or 1-Var Stats L2 for List 2, for example. There are two screens of output; we will be mostly concerned
with the mean , the standard deviation Sx, and the five-number summary on screen two.
Goals: Compare
numerical measures of center.
Summarize data with numerical measures and box plots. Compare these new measures with the
histograms, stem plots, and quantile plots you made on Day
3.
Skills:
…
Understand the effect
of outliers on the mean.
The mean (or average) is unduly influenced by outlying
(unusual) observations. Therefore,
knowing when your distribution is skewed is
helpful.
…
Understand the effect
of outliers on the median. The median is almost completely
unaffected by outliers. For
technical reasons, though, the median is not as common in scientific
applications as the mean.
…
Use the TI-83 to
calculate summary statistics.
Calculating may be as simple as entering numbers into
your calculator and pressing a button.
Or, if you are doing some things by hand, you may have to organize
information the correct way, such as listing the numbers from low to high. On the TI-83, the numerical measures
are calculated using STAT CALC 1-Var
Stats L#. Please
get used to using the statistical features of your calculator to produce the
mean. While I know you can
calculate the mean by simply adding up all the numbers and dividing by the
sample size, you will not be in the habit of using the full features of your
machine, and later on you will be missing out.
…
Compare several lists
of numbers using box plots.
For two lists, the best simple approach is the
back-to-back stem plot. For more
than two lists, I suggest trying box plots, side-by-side, or stacked. At a glance, then, you can assess which
lists have typically larger values or more spread out values,
etc.
…
Understand box
plots. You should know that the box plots for some lists
don't tell the interesting part of those lists. For example, box plots do not describe shape very well; you can only see where the
quartiles are. Alternatively, you
should know that the box plot can
be a very good first quick look.
Reading: Section 1.2.
Activity: Create the following lists:
1) A list of 10 numbers that has
only one number below the mean.
2) A list of 10 numbers that has the
standard deviation greater than the mean.
3) A list of 10 numbers that has a
standard deviation of zero.
For your fourth list start with any 21 numbers. Find a number N
such that 14 of the numbers in your list are within N of the average.
For example, pick a number N
(say 4), calculate the average plus 4, the average minus 4, and count how many
numbers in your list are between those two values. If the count is less than 14, try a larger number
for N (bigger than 4). If the count is more than 14, try a smaller number
for N (smaller than 4).
Finally, compare the standard deviation to the Inter Quartile Range (IQR = Q3 -
Q1).
(You may use any extra time today to discuss Presentation 1 in your groups.)
Goals: Interpret
standard deviation as a measure of spread.
Skills:
…
Understand standard
deviation. At first, standard deviation will seem foreign to you,
but I believe that it will make more sense the more you become familiar with
it. In its simplest terms, the
standard deviation is non-negative number that measures how "wide" a
dataset is. One common
interpretation is that the range of a dataset is 4 standard deviations. Another interpretation is that the
standard deviation is roughly ¾ times IQR. Eventually we will use the standard deviation in our
calculations for statistical inference; until then, this measure is just
another summary statistic, and getting used to this number is your goal. The normal curve of the next section
will further help us understand standard deviation.
Reading: Section 1.3.
Activity: Introduce the TI-83's normal
calculations.
Homework 1 due.
DISTR
normalcdf( lower, upper ) calculates the
area under a normal curve between lower
and upper. If you specify just 2 values, mean 0
and standard deviation 1 are assumed.
If you want a different mean or standard deviation, add a third and
fourth parameter.
Example: DISTR normalcdf(
-10, 20, 5, 10 ) finds the area between
-10 and +20 on a normal curve with mean 5 and standard deviation 10
while DISTR normalcdf(
-2, 2 ) finds the area on the standard normal curve between -2 and
+2.
DISTR
invNorm( works backwards, but
only gives upper as an answer.
It is also referred to as a percentile. The 90th percentile is that point at which 90 %
of the observations are below. The
syntax is DISTR invNorm( .90 ) or DISTR invNorm(
.90, 5, 10 ) ; the first example assumes
the standard normal curve and reports the 90th percentile. The second example uses a mean of 5 and
a standard deviation of 10 and also reports the 90th percentile.
Note that if the desired area is above a certain number, you will have to use subtraction or
symmetry, as DISTR
invNorm( only
reports values below, or to the left.
Goals: Introduce
normal curve. Use TI-83 in place
of the standard normal table in the text.
Skills:
…
Know what a
z-score is (standardization).
Sometimes, instead of knowing a variable's actual value, we
are only interested
in how far above or below average it is.
This information is contained in the z-score.
Negative values indicate a below average observation, while positive
values are above average. If the
list follows a normal distribution (the familiar "bell-shaped" curve)
then it will be relatively rare to have values below -2 or above +2 (only about
5 % of cases). Even if the list is
not normal, surprisingly the z-score
still tends to have few values beyond ±2, although this is not
guaranteed.
…
Using the TI-83 to
find areas under the normal curve.
When we have a distribution
that can be approximated with the bell-shaped normal curve, we can make
accurate statements about frequencies and percentages by knowing just the mean
and the standard deviation of the data.
Our TI-83 has 2 functions, DISTR normalcdf( and DISTR invNorm(
which allow us to calculate these percentages more easily and more accurately
than the table in the text. We use
DISTR
normalcdf( when we want the percentage as
an answer and we use DISTR invNorm( when we already
know the percentage but not the value that gives that percentage.
Reading: Section 1.3.
Activity: Practice normal calculations.
1) Suppose SAT scores are
distributed normally with mean 800 and standard deviation 100. Estimate the chance that a randomly
chosen score will be above 720.
Estimate the chance that a randomly chosen score with be between 800 and
900. The top 20% of scores are
above what number? (This is called
the 80th percentile.)
2) Find the Inter Quartile Range
(IQR) for the standard normal (mean 0, standard deviation 1). Compare this to the standard deviation
of 1.
3) Women aged 20 to 29 have
normally distributed heights with mean 64 and standard deviation 2.7. Men have mean 69.3 with standard
deviation 2.8. What percent of
women are taller than the average man, and what percentage of men are taller
than the average woman?
4) Pretend we are manufacturing
fruit snacks, and that the average weight in a package is .92 ounces with
standard deviation 0.05. What
should we label the net weight on the package so that only 5 % of packages are
"underweight"?
5) Suppose that your average
commute time to work is 20 minutes, with standard deviation of 2 minutes. What time should you leave home to
arrive to work on time at 8:00?
(You may have to decide a reasonable value for the chance of being
late.)
Goals: Master
normal calculations. Realize that
summarizing using the normal curve is the ultimate reduction in complexity, but
only applies to data whose distribution is actually
bell-shaped.
Skills:
…
Memorize 68-95-99.7
rule. While we do rely on our technology to calculate areas
under normal curves, it is convenient to have some of the values committed to
memory. These values can be used
as rough guidelines; if precision is required, you should use the TI-83
instead. I will assume you know
these numbers by heart when we encounter the normal numbers again in chapters 5
through 8.
…
Understand that
summarizing with just the mean and standard deviation is a special case. We
have progressed from pictures like histograms and quantile plots to summary
statistics like medians, means, and standard deviations to finally summarizing
an entire list with just two numbers: the mean and the standard deviation. However, this last step in our
summarization only applies to
lists whose distribution resembles the bell-shaped normal curves. If the data's distribution is skewed,
or has any other shape, this level of summarization is insufficient. Also, it is important to realize that
these calculations are only approximations.
…
Interpret a normal
quantile plot. We often want to know if a list of data can be
approximated with a normal curve.
While we might try histograms and quantile plots to see if
they "look
normal", it is a difficult task, because we have to match the shape to the
very special shape of the normal curve.
One simple alternative graphical method is the normal
quantile plot. This
plot is nearly identical to a quantile plot, but instead of graphing the
percentiles, we graph the z-scores. Our TI-83 does this for us; the sixth
icon in the STAT PLOT Type. Be cautious though; the graph, as
usual, is unlabeled. However, we
only care if the graph is nearly a straight line or not.
Reading: Sections 2.1 and 2.2.
Activity: Using the "Arizona Temps" data, plot "Flagstaff
High" versus "Phoenix High".
Then guess what the correlation coefficient might be without using your calculator. Use the sample diagrams on page 126 to guide you.
Finally, using your calculator, calculate the actual value for the correlation
coefficient and compare it to your guess.
Repeat for the variables "Flagstaff High" and "Flagstaff
Low".
Goals: Display
two variables and measure (and interpret) linear association using the
correlation coefficient.
Skills:
…
Plot data with a
scatter plot. This will be as simple as entering two lists of
numbers into your TI-83 and pressing a few buttons, just as for histograms or
box plots. Or, if you are doing
plots by hand you will have to first choose an appropriate axis scale and then
plot the points. You should also
be able to describe overall patterns in scatter diagrams and suggest tentative
models that summarize the main features of the relationship, if
any.
…
Use the TI-83 to
calculate the correlation coefficient.
We will have to use the
regression function STAT CALC LinReg(ax+b) to
calculate correlation, r. First, you will have to have pressed DiagnosticOn. Access
this command through the CATALOG (2nd 0). If you
type ENTER after the STAT CALC
LinReg(ax+b) command, the calculator
assumes your lists are in columns L1and
L2; otherwise you will type where they are,
for example STAT CALC
LinReg(ax+b) L2, L3.
…
Interpret the
correlation coefficient.
You should know the range of the correlation
coefficient (-1 to +1) and what a "typical" diagram looks like for
various values of the correlation coefficient. Again, page 126 is your guide. You should recognize some of the things the correlation
coefficient does not measure, such
as the strength of a non-linear
pattern.
Reading: Section 2.2.
Activity: Outlier effects on
correlation. The dataset we will
explore today has 7 data points.
Plot them and calculate the correlation coefficient.
Add an eighth point in three different places and for each new dataset,
recalculate the correlation coefficient.
Summarize the effect of outliers in a paragraph.
(You may use any extra time today to discuss Presentation 1 in your
groups.) Homework 2
due.
Goals:
Understand
the impact of outliers on correlation.
Skills:
…
Interpret the
correlation coefficient. You should recognize how outliers
influence the magnitude of the correlation coefficient. One simple way to observe the
effects of
outliers is to calculate the correlation coefficient with and without the
outlier in the dataset and compare the two values. If the values vary greatly (this is a judgment call) then
you would say the outlier is "influential".
Reading: Section 2.3.
Activity: Using the Olympic data, fit a
regression line to predict the 2004 and 2008 race
results.
Goals: Practice
using regression with the TI-83.
We want the regression equation, the regression line superimposed on the
plot, the correlation coefficient, and we want to be able to use the line to
predict new values.
Skills:
…
Fit a line to
data. This may be as simple as 'eyeballing' a straight line
to a scatter plot. However, to be
more precise, we will use least squares, STAT CALC LinReg(ax+b) on the TI-83, to calculate the
coefficients, and VARS Statistics
EQ RegEQ to type the equation
in the Y= menu.
You should also be able to sketch a line onto a scatter plot (by hand)
by knowing the regression coefficients.
…
Interpret regression
coefficients. Usually, we want to only interpret slope, and slope is
best understood by examining the units involved, such as inches per year or
miles per gallon, etc. Because
slope can be thought of as "rise" over "run", we are
looking for the ratio of the units involved in our two variables. More precisely, the slope tells us the
change in the response variable for a unit change in the explanatory
variable. We don't
typically bother
interpreting the intercept, as zero is often outside of the range of
experimentation.
…
Estimate/predict new
observations using the regression line.
Once we have calculated a
regression equation, we can use it to predict new responses. The easiest way to use the TI-83 for
this is to TRACE on the regression
line. You may need to use up and
down arrows to toggle back and forth from the plot to the line. You may also just use the equation
itself by multiplying the new x-value
by the slope and adding the intercept.
(This is exactly what TRACE is
doing.)
Reading: Section 2.3.
Activity: Revisit outliers dataset, adding
regression lines. Plot the data
again and calculate the regression line.
Add an eighth point in three different places and for each new dataset,
recalculate the regression line.
Summarize the effect of outliers in a paragraph.
Goals: Practice
using regression with the TI-83.
We want the regression equation, the regression line superimposed on the
plot, the correlation coefficient, and we want to be able to use the line to
predict new values.
Skills:
…
Understand
the limitations
and strengths of linear regression.
Quite simply, linear
regression should only be used with scatter plots that are roughly linear in
nature. That seems obvious. However, there is nothing that prevents
us from calculating the numbers
for any data set we can input into our TI-83's. We have to realize what our data looks like
before we calculate the regression; therefore a scatter plot
is essential. In the presence of outliers and
non-linear patterns, we should avoid drawing conclusions from the fitted
regression line.
Reading: Sections 2.4 and 2.5.
Activity: Correlation/Regression
summary. U. S. population
example. Alternate regression
models. Homework 3
due.
1)&n