Day By Day Notes for MATH 201

Fall 2006

 

Day 1

Activity: Go over syllabus.  Take roll.  Overview examples: Gilbert trial, election polls, spam filters

Goals:     Review course objectives: collect data, summarize information, make inferences.

Reading: To The Student, pages xxxi-xxxiv.

Day 2

Activity: Discussion of variables and graphs.  From a list of numbers, communicate the important information to the person next to you.  (Work in pairs or groups.)  For your list of numbers, make a frequency table and a histogram.


Useful commands for the calculator:
STAT EDIT (use one of the lists to enter data, L1 for example; the other L's can be used too.)
2nd STATPLOT 1 On (Use this screen to designate the plot settings.  You can have up to three plots on the screen at once.  For now we will only use one at a time.)
ZOOM 9 This command centers the window around your data.

 In your description to your neighbor, keep in mind these terms:  symmetry, skew, center, spread, mode, outlier.  Also make sure that you try different window settings for your histogram.

Goals:     Begin graphical summaries (describing data with pictures).  Be able to use the calculator to make a histogram.

Skills:

                        Identify types of variables.  To choose the proper graphical displays, it is important to be able to differentiate between Categorical (or Qualitative) and Quantitative (or Numerical) variables.

                        Be familiar with types of graphs.  To graph categorical variables we use bar graphs or pie graphs.  To graph numerical variables, we use histograms, stem plots, or QUANTILE (a TI-83 program we will explore on Day 3).  In practice, most of our variables will be numerical but it is still important to choose the right display.

                        Summarize data into a frequency table.  The easiest way to make a frequency table is to first use the TI-83 to make a histogram and then to TRACE over the boxes and record the classes and counts.  You can control the size and number of the classes with Xscl and Xmin in the WINDOW menu.  The decision as to how many classes to create is arbitrary; there isn't a "right" answer, or rather all choices of Xscl and Xmin are "right" answers.  One popular suggestion is try the square root of the number of data values.  For example, if there are 25 data points, use 5 intervals.  If there are 50 data points, try 7 intervals.  This is a rough rule; you should experiment with it.  The TI-83 has a rule for doing this; I do not know what their rule is.  You should develop your intuitions by changing the interval width Xscl and starting point Xmin and see what happens to the display.

                        Know how to create and interpret graphs for categorical variables.  The two main graphs for categorical variables are pie graphs and bar charts.  Pie graphs are difficult to make by hand, but popular on computer programs like Excel.  Bar charts are also common on spreadsheets.  Data represented by pie graphs and bar charts usually are expressed as percents of the whole; thus they add to 100%.  The ordering of categories is arbitrary; therefore concepts such as skew and center make no sense.

Reading: Section 1.1. (Skip Time Plots and Time Series.)

Day 3

Activity: Use the "Arizona Temps" dataset to practice creating the histograms, stem plots, and quantile plots for several lists.  Compare and interpret the graphs.  Identify shape, center, and spread.
    

              QUANTILE is a program I wrote that plots the sorted data in a list and "stacks" them up.  This is also known as a quantile plot.  Basically we are graphing the data value versus its rank, or percentile, in the dataset.  The syntax is PRGM EXEC QUANTILE ENTER.

Answer these questions:

1)  Do any of the lists have outliers?

2)  What information does the stem plot show that the histogram hides?

3)  What information does the quantile plot show that the stem plot hides?

Goals:     Be able to use the calculator to make (and be able to interpret) a quantile plot, using the program QUANTILE.  Be able to make a stem plot by hand.

Skills:

                        Use the TI-83 to create an appropriate histogram or quantile plot.  STAT PLOT and QUANTILE are our two main tools for viewing distributions of data.  Histograms are common displays, but have flaws; the choice of class width is troubling as it is not unique.  The quantile plot is more reliable, but less common.  For interpretation purposes, remember that in a histogram tall boxes represent places with lots of data, while in a quantile plot those same high-density data places are represented by steepness.

                        Create a stem plot by hand.  The stem plot is a convenient manual display; it is most useful for small datasets, but not all datasets make good stem plots.  Choosing the "stem" and "leaves" to make reasonable displays will require some practice.  Some notes for proper choice of stems: if you have many empty rows, you have too many stems.  Move one column to the left and try again.  If you have too few rows (all the data is on just one or two stems) you have too few stems.  Move to the right one digit and try again.  Some datasets will not give good pictures for any choice of stem, and some benefit from splitting or rounding (see the example on page 13).

                        Describe shape, center, and spread.  From each of our graphs, you should be able to make general statements about the shape, center, and spread of the distribution of the variable being explored.  One of the main conclusions we want to make about lists of data when we are doing inference (Chapters 6 to 8) is whether the data is close to symmetric; many times "close enough" is, well, close enough!  We will discuss this in more detail when we see the Central Limit Theorem in Chapter 5.

Reading: Section 1.2.

Day 4

Activity: Dance Fever example.  Use the "Arizona Temps" dataset to calculate the mean, the standard deviation, the 5-number summary, and the associated box plot for any of the variables.

Compare these measures with the corresponding histograms and quantile plots you did on Day 2.  Note the similarities (where the data values are dense, and where they are sparse) but especially note the differences.  The box plots and numerical measures cannot describe shape very well.  The histograms are hard to use to compare two lists.  The stem and leaf is difficult to modify.

Answer these questions:

1)  Are high and low temperatures distributed the same way, other than the obvious fact that highs are higher than lows?

2)  How does a single case affect the calculator's routines?  (What if we had had an outlier?)

3)  What information does the box plot disguise?

To calculate our summary statistics, we will use
1-Var Stats (to use List 1) or 1-Var Stats L2 for List 2, for example.  There are two screens of output; we will be mostly concerned with the mean , the standard deviation Sx, and the five-number summary on screen two.

Goals:     Compare numerical measures of center.  Summarize data with numerical measures and box plots.  Compare these new measures with the histograms, stem plots, and quantile plots you made on Day 3.

Skills:

                        Understand the effect of outliers on the mean.  The mean (or average) is unduly influenced by outlying (unusual) observations.  Therefore, knowing when your distribution is skewed is helpful.

                        Understand the effect of outliers on the median.  The median is almost completely unaffected by outliers.  For technical reasons, though, the median is not as common in scientific applications as the mean.

                        Use the TI-83 to calculate summary statistics.  Calculating may be as simple as entering numbers into your calculator and pressing a button.  Or, if you are doing some things by hand, you may have to organize information the correct way, such as listing the numbers from low to high.  On the TI-83, the numerical measures are calculated using STAT CALC 1-Var Stats L#.  Please get used to using the statistical features of your calculator to produce the mean.  While I know you can calculate the mean by simply adding up all the numbers and dividing by the sample size, you will not be in the habit of using the full features of your machine, and later on you will be missing out.

                        Compare several lists of numbers using box plots.  For two lists, the best simple approach is the back-to-back stem plot.  For more than two lists, I suggest trying box plots, side-by-side, or stacked.  At a glance, then, you can assess which lists have typically larger values or more spread out values, etc.

                        Understand box plots.  You should know that the box plots for some lists don't tell the interesting part of those lists.  For example, box plots do not describe shape very well; you can only see where the quartiles are.  Alternatively, you should know that the box plot can be a very good first quick look.

Reading: Section 1.2.

Day 5

Activity: Create the following lists:

1)  A list of 10 numbers that has only one number below the mean.

2)  A list of 10 numbers that has the standard deviation greater than the mean.

3)  A list of 10 numbers that has a standard deviation of zero.
For your fourth list start with any 21 numbers.  Find a number N
such that 14 of the numbers in your list are within N of the average.  For example, pick a number N (say 4), calculate the average plus 4, the average minus 4, and count how many numbers in your list are between those two values.  If the count is less than 14, try a larger number for N (bigger than 4).  If the count is more than 14, try a smaller number for N (smaller than 4).

Finally, compare the standard deviation to the Inter Quartile Range (IQR = Q3 - Q1).

(You may use any extra time today to discuss Presentation 1 in your groups.)

Goals:     Interpret standard deviation as a measure of spread.

Skills:

                        Understand standard deviation.  At first, standard deviation will seem foreign to you, but I believe that it will make more sense the more you become familiar with it.  In its simplest terms, the standard deviation is non-negative number that measures how "wide" a dataset is.  One common interpretation is that the range of a dataset is 4 standard deviations.  Another interpretation is that the standard deviation is roughly ¾ times IQR.  Eventually we will use the standard deviation in our calculations for statistical inference; until then, this measure is just another summary statistic, and getting used to this number is your goal.  The normal curve of the next section will further help us understand standard deviation.

Reading: Section 1.3.

Day 6

Activity: Introduce the TI-83's normal calculations.  Homework 1 due.

DISTR normalcdf( lower, upper ) calculates the area under a normal curve between lower and upper.  If you specify just 2 values, mean 0 and standard deviation 1 are assumed.  If you want a different mean or standard deviation, add a third and fourth parameter.  Example: DISTR normalcdf( -10, 20, 5, 10 ) finds the area between -10 and +20 on a normal curve with mean 5 and standard deviation 10 while DISTR normalcdf( -2, 2 )  finds the area on the standard normal curve between -2 and +2.

DISTR invNorm( works backwards, but only gives upper as an answer.  It is also referred to as a percentile.  The 90th percentile is that point at which 90 % of the observations are below.  The syntax is DISTR invNorm( .90 ) or DISTR invNorm( .90, 5, 10 ) ; the first example assumes the standard normal curve and reports the 90th percentile.  The second example uses a mean of 5 and a standard deviation of 10 and also reports the 90th percentile.

Note that if the desired area is above
a certain number, you will have to use subtraction or symmetry, as DISTR invNorm( only reports values below, or to the left.

Goals:     Introduce normal curve.  Use TI-83 in place of the standard normal table in the text.

Skills:

                        Know what a z-score is (standardization).  Sometimes, instead of knowing a variable's actual value, we are only interested in how far above or below average it is.  This information is contained in the z-score.  Negative values indicate a below average observation, while positive values are above average.  If the list follows a normal distribution (the familiar "bell-shaped" curve) then it will be relatively rare to have values below -2 or above +2 (only about 5 % of cases).  Even if the list is not normal, surprisingly the z-score still tends to have few values beyond ±2, although this is not guaranteed.

                        Using the TI-83 to find areas under the normal curve.  When we have a distribution that can be approximated with the bell-shaped normal curve, we can make accurate statements about frequencies and percentages by knowing just the mean and the standard deviation of the data.  Our TI-83 has 2 functions, DISTR normalcdf( and DISTR invNorm( which allow us to calculate these percentages more easily and more accurately than the table in the text.  We use DISTR normalcdf( when we want the percentage as an answer and we use DISTR invNorm( when we already know the percentage but not the value that gives that percentage.

Reading: Section 1.3.

Day 7

Activity: Practice normal calculations.

1)  Suppose SAT scores are distributed normally with mean 800 and standard deviation 100.  Estimate the chance that a randomly chosen score will be above 720.  Estimate the chance that a randomly chosen score with be between 800 and 900.  The top 20% of scores are above what number?  (This is called the 80th percentile.)

2)  Find the Inter Quartile Range (IQR) for the standard normal (mean 0, standard deviation 1).  Compare this to the standard deviation of 1.

3)  Women aged 20 to 29 have normally distributed heights with mean 64 and standard deviation 2.7.  Men have mean 69.3 with standard deviation 2.8.  What percent of women are taller than the average man, and what percentage of men are taller than the average woman?

4)  Pretend we are manufacturing fruit snacks, and that the average weight in a package is .92 ounces with standard deviation 0.05.  What should we label the net weight on the package so that only 5 % of packages are "underweight"?

5)  Suppose that your average commute time to work is 20 minutes, with standard deviation of 2 minutes.  What time should you leave home to arrive to work on time at 8:00?  (You may have to decide a reasonable value for the chance of being late.)

Goals:     Master normal calculations.  Realize that summarizing using the normal curve is the ultimate reduction in complexity, but only applies to data whose distribution is actually bell-shaped.

Skills:

                        Memorize 68-95-99.7 rule.  While we do rely on our technology to calculate areas under normal curves, it is convenient to have some of the values committed to memory.  These values can be used as rough guidelines; if precision is required, you should use the TI-83 instead.  I will assume you know these numbers by heart when we encounter the normal numbers again in chapters 5 through 8.

                        Understand that summarizing with just the mean and standard deviation is a special case.  We have progressed from pictures like histograms and quantile plots to summary statistics like medians, means, and standard deviations to finally summarizing an entire list with just two numbers: the mean and the standard deviation.  However, this last step in our summarization only applies to lists whose distribution resembles the bell-shaped normal curves.  If the data's distribution is skewed, or has any other shape, this level of summarization is insufficient.  Also, it is important to realize that these calculations are only approximations.

                        Interpret a normal quantile plot.  We often want to know if a list of data can be approximated with a normal curve.  While we might try histograms and quantile plots to see if they "look normal", it is a difficult task, because we have to match the shape to the very special shape of the normal curve.  One simple alternative graphical method is the normal quantile plot.  This plot is nearly identical to a quantile plot, but instead of graphing the percentiles, we graph the z-scores.  Our TI-83 does this for us; the sixth icon in the STAT PLOT Type.  Be cautious though; the graph, as usual, is unlabeled.  However, we only care if the graph is nearly a straight line or not.

Reading: Sections 2.1 and 2.2.

Day 8

Activity: Using the "Arizona Temps" data, plot "Flagstaff High" versus "Phoenix High".

Then guess what the correlation coefficient might be without
using your calculator.  Use the sample diagrams on page 126 to guide you.

Finally, using your calculator, calculate the actual value for the correlation coefficient and compare it to your guess.

Repeat for the variables "Flagstaff High" and "Flagstaff Low".

Goals:     Display two variables and measure (and interpret) linear association using the correlation coefficient.

Skills:

                        Plot data with a scatter plot.  This will be as simple as entering two lists of numbers into your TI-83 and pressing a few buttons, just as for histograms or box plots.  Or, if you are doing plots by hand you will have to first choose an appropriate axis scale and then plot the points.  You should also be able to describe overall patterns in scatter diagrams and suggest tentative models that summarize the main features of the relationship, if any.

                        Use the TI-83 to calculate the correlation coefficient.  We will have to use the regression function STAT CALC LinReg(ax+b) to calculate correlation, r.  First, you will have to have pressed DiagnosticOn.  Access this command through the CATALOG (2nd 0).  If you type ENTER after the STAT CALC LinReg(ax+b) command, the calculator assumes your lists are in columns L1and L2; otherwise you will type where they are, for example STAT CALC LinReg(ax+b) L2, L3.

                        Interpret the correlation coefficient.  You should know the range of the correlation coefficient (-1 to +1) and what a "typical" diagram looks like for various values of the correlation coefficient.  Again, page 126 is your guide.  You should recognize some of the things the correlation coefficient does not measure, such as the strength of a non-linear pattern.

Reading: Section 2.2.

Day 9

Activity: Outlier effects on correlation.  The dataset we will explore today has 7 data points.  Plot them and calculate the correlation coefficient.

Add an eighth point in three different places and for each new dataset, recalculate the correlation coefficient.

Summarize the effect of outliers in a paragraph.

(You may use any extra time today to discuss Presentation 1 in your groups.)  Homework 2 due.


Goals:     Understand the impact of outliers on correlation.

Skills:

                        Interpret the correlation coefficient.  You should recognize how outliers influence the magnitude of the correlation coefficient.  One simple way to observe the effects of outliers is to calculate the correlation coefficient with and without the outlier in the dataset and compare the two values.  If the values vary greatly (this is a judgment call) then you would say the outlier is "influential".

Reading: Section 2.3.

Day 10

Activity: Using the Olympic data, fit a regression line to predict the 2004 and 2008 race results.

Goals:     Practice using regression with the TI-83.  We want the regression equation, the regression line superimposed on the plot, the correlation coefficient, and we want to be able to use the line to predict new values.

Skills:

                        Fit a line to data.  This may be as simple as 'eyeballing' a straight line to a scatter plot.  However, to be more precise, we will use least squares, STAT CALC LinReg(ax+b) on the TI-83, to calculate the coefficients, and VARS Statistics EQ RegEQ to type the equation in the Y= menu.  You should also be able to sketch a line onto a scatter plot (by hand) by knowing the regression coefficients.

                        Interpret regression coefficients.  Usually, we want to only interpret slope, and slope is best understood by examining the units involved, such as inches per year or miles per gallon, etc.  Because slope can be thought of as "rise" over "run", we are looking for the ratio of the units involved in our two variables.  More precisely, the slope tells us the change in the response variable for a unit change in the explanatory variable.  We don't typically bother interpreting the intercept, as zero is often outside of the range of experimentation.

                        Estimate/predict new observations using the regression line.  Once we have calculated a regression equation, we can use it to predict new responses.  The easiest way to use the TI-83 for this is to TRACE on the regression line.  You may need to use up and down arrows to toggle back and forth from the plot to the line.  You may also just use the equation itself by multiplying the new x-value by the slope and adding the intercept.  (This is exactly what TRACE is doing.)

Reading: Section 2.3.

Day 11

Activity: Revisit outliers dataset, adding regression lines.  Plot the data again and calculate the regression line.

Add an eighth point in three different places and for each new dataset, recalculate the regression line.

Summarize the effect of outliers in a paragraph.

Goals:     Practice using regression with the TI-83.  We want the regression equation, the regression line superimposed on the plot, the correlation coefficient, and we want to be able to use the line to predict new values.

Skills:

                        Understand the limitations and strengths of linear regression.  Quite simply, linear regression should only be used with scatter plots that are roughly linear in nature.  That seems obvious.  However, there is nothing that prevents us from calculating the numbers for any data set we can input into our TI-83's.  We have to realize what our data looks like before we calculate the regression; therefore a scatter plot is essential.  In the presence of outliers and non-linear patterns, we should avoid drawing conclusions from the fitted regression line.

Reading: Sections 2.4 and 2.5.

Day 12

Activity: Correlation/Regression summary.  U. S. population example.  Alternate regression models.  Homework 3 due.

1)&n