Day By Day Notes for PBIS 189

Spring 2006

 

Day 1

Activity: Go over syllabus.  Take roll.  Overview examples: Gilbert trial, election polls, spam filters

Goals:     Review course objectives: collect data, summarize information, make inferences.

Reading: To The Student, pages xx-xxv

 

Day 2

 

Activity: Video 1 – Overview of Statistics, Discussion of variables and graphs.

Goals:     Get a feel for what questions we answer with statistics.  Begin graphical summaries (describing data with pictures).

Skills:

á                        Identify types of variables.  To choose the proper graphical displays, it is important to be able to differentiate between Categorical and Quantitative (or Numerical) variables.

á                        Be familiar with types of graphs.  To graph categorical variables we use bar graphs or pie graphs.  To graph numerical variables, we use histograms, stemplots, or CUMPLOT (TI-83 program).  In practice, most of our variables will be numerical but it is still important to choose the right display.

Reading: Chapter 1 (Skip Time Plots)

 

Day 3

Activity: Use the monarchs dataset to create the histograms, stemplots, and cumplots for the variable "years reigned" separately for the Saxon Rulers (829 to 1066), the rulers from William I to Henry VI (1066 to 1471), the rulers from Edward IV to Charles I (1461 to 1649), and the rulers from Charles II to present (1660 to 1998). Compare and interpret the graphs.  Identify shape, center, and spread.
Useful commands for the calculator:
    
STAT EDIT (use one of the lists to enter data, L1 for example; the other LÕs can be used too)
    
2nd STATPLOT 1 On (Use this screen to designate the plot settings.  You can have up to three plots on the screen at once.  For now we will only use one at a time.)
    
ZOOM 9 This command centers the window around your data.
    
CUMPLOT This program I wrote plots the sorted data and "stacks" them up.

Goals:     Be able to use the calculator to make a histogram or a cumplot.  Be able to make a stemplot by hand.

Skills:

á                        Summarize data into a frequency table.  The easiest way to make a frequency table is to TRACE the boxes in a histogram and record the classes and counts.  You can control the size and number of the classes with Xscl and Xmin in the WINDOW menu.  The decision as to how many classes to create is arbitrary; there isn't a "right" answer.  One popular suggestion is try the square root of the number of data values.  For example, if there are 25 data points, use 5 intervals.  If there are 50 data points, try 7 intervals.  This is a rough rule; you should experiment with it.  The TI-83 has a rule for doing this; I do not know what their rule is.  You should experiment by changing the interval width and see what happens to the diagram.

á                        Use the TI-83 to create an appropriate histogram or cumplot.  STAT PLOT is our main tool for viewing distributions of data.  Histograms are common displays, but have flaws; the choice of class width is troubling as it is not unique.  The cumplot is more reliable, but less common.  For interpretation purposes, remember that in a histogram tall boxes represent places with lots of data, while in a cumulative plot those same high-density data places are steep.

á                        Create a stemplot by hand.  The stemplot is a convenient manual display; it is most useful for small datasets, but not all datasets make good stemplots.  Choosing the "stem" and "leaves" to make reasonable displays will require some practice.  Some notes for proper choice of stems: if you have many empty rows, you have too many stems.  Move one column to the left and try again.  If you have too few rows (all the data is on just one or two stems) you have too few stems.  Move to the right one digit and try again.  Some datasets will not give good pictures for any choice of stem, and some benefit from splitting or rounding (see the example in the text).

á                        Describe shape, center, and spread.  From each of our graphs, you should be able to make general statements about the shape, center, and spread of the distribution of the variable being explored. 

Reading: Chapter 1 (Skip Time Plots)

 

Day 4

Activity: Video 2 – Lightning Research.  Dance Fever example.
To calculate our summary statistics, we will use
1-Var Stats (to use List 1) or 1-Var Stats L2 for List 2, for example.  There are two screens of output; we will be mostly concerned with the mean x-bar, the standard deviation Sx, and the five-number summary on screen two.

Goals:     Observe the creation and interpretation of graphical displays in practice.  Compare numerical measures of center.

Skills:

á                        Understand the effect of outliers on the mean.  The mean (or average) is unduly influenced by outlying (unusual) observations.  Therefore, knowing when your distribution is skewed is helpful.

á                        Understand the effect of outliers on the median.  The median is almost completely unaffected by outliers.  For technical reasons, though, the median is not as common in scientific applications as the mean.

Reading: Chapter 2

 

Day 5

Activity: Use the monarchs dataset to calculate the mean, the standard deviation, the 5-number summary, and the associated boxplot for the variable "years reigned" separately for the Saxon Rulers (829 to 1066), the rulers from William I to Henry VI (1066 to 1471), the rulers from Edward IV to Charles I (1461 to 1649), and the rulers from Charles II to present (1660 to 1998).
     Compare these measures with the corresponding histogram and cumulative plot.  Note the similarities (where the data values are dense, and where they are sparse) but especially note the differences.  The boxplots and numerical measures cannot describe shape.  The histograms are hard to use to compare two lists.  The stem and leaf is difficult to modify.
     Answer these questions:
1)  Has the variable "years reigned" changed over time?
2)  How does a single case affect the calculator's routines?
3)  What information does the boxplot disguise?

Goals:     Summarize data with numerical measures and boxplots.  Compare these new measures with the histograms, stemplots, and cumplots you made on Day 3.

Skills:

á                        Use the TI-83 to calculate summary statistics.  Calculating may be as simple as entering numbers into your calculator and pressing a button.  Or, if you are doing some things by hand, you may have to organize information the correct way, such as listing the numbers from low to high.  On the TI-83, the numerical measures are accessed in 1-Var Stats function in the STAT CALC menu.  Please get used to using the statistical features of your calculator to produce the mean.  While I know you can calculate the mean by simply adding up all the numbers and dividing by the sample size, you will not be in the habit of using the full features of your machine, and later on you will be 'missing the boat'.

á                        Compare several lists of numbers using boxplots.  For two lists, the best simple approach is the back-to-back stemplot.  For more than two lists, I suggest trying boxplots, side-by-side, or stacked.  At a glance, then, you can assess which lists have typically larger values or more spread out values, etc.

á                        Understand boxplots.  You should know that the boxplots for some lists don't tell the interesting part of those lists.  For example, boxplots do not describe shape; you can only see where the quartiles are.  Alternatively, you should know that the boxplot can be a very good first quick look.

Reading: Chapter 2

 

Day 6

Activity: Create the following lists:
1)  A list of 10 numbers that has only one number below the mean.
2)  A list of 10 numbers that has the standard deviation greater than the mean.
3)  A list of 10 numbers that has a standard deviation of zero.
For your fourth list start with any 21 numbers.  Find a number N
such that 14 of the numbers in your list are within N of the average.  For example, pick a number N (say 4), calculate the average plus 4, the average minus 4, and count how many numbers in your list are between those two values.  If the count is less than 14, try a larger number for N (bigger than 4).  If the count is more than 14, try a smaller number for N (smaller than 4).
Finally, compare the standard deviation to the Interquartile Range (IQR = Q3 - Q1).

Goals:     Interpret standard deviation as a measure of spread.

Skills:

á                        Understand standard deviation.  At first, standard deviation will seem foreign to you, but I believe that it will make more sense the more you become familiar with it.  In its simplest terms, the standard deviation is non-negative number that measures how "wide" a dataset is.  One common interpretation is that the range of a dataset is 4 standard deviations.  Another interpretation is that the standard deviation is roughly ¾ times IQR.  Eventually we will use the standard deviation in our calculations for statistical inference; until then, this measure is just another summary statistic, and getting used to this number is your goal.  The normal curve of Chapter 3 will further help us understand standard deviation.

Reading: Chapter 3

 

Day 7

Activity: Review Homework 1.  Video 3 – Boston Beanstalks.  Introduce the TI-83's normal calculations.

Goals:     Introduce normal curve.  Use TI-83 in place of the standard normal table in the text.

Skills:

á                        Using the TI-83 to find areas under the normal curve.  When we have a distribution that can be approximated with the bell-shaped normal curve, we can make accurate statements about frequencies and percentages by knowing just the mean and the standard deviation of the data.  Our TI-83 has 2 functions, normalcdf( and invNorm( which allow us to calculate these percentages more easily and more accurately than the table in the text.  We use normalcdf( when we want the percentage as an answer and we use invNorm( when we already know the percentage but not the value that gives that percentage.

Reading: Chapter 3

 

Day 8

Activity: Practice normal calculations.
1)  Suppose SAT scores are distributed normally with mean 800 and standard deviation (sd) 100.  Estimate the chance that a randomly chosen score will be above 720.  Estimate the chance that a randomly chosen score with be between 800 and 900.  The top 20% of scores are above what number?  (This is called the 80th percentile.)
2)  Find the Interquartile Range (IQR) for the standard normal (mean 0, sd 1).  Compare this to the standard deviation of 1.
3)  Women aged 20 to 29 have normally distributed heights with mean 64 and sd 2.7.  Men have mean 69.3 with sd 2.8.  what percent of women are taller than the average man, and what percentage of men are taller than the average woman?
4)  Pretend we are manufacturing fruit snacks, and that the average weight in a package is .92 ounces with sd 0.05.  What should we label the net weight on the package so that only 5 % of packages are "underweight"?
5)  Suppose that your average commute time to work is 20 minutes, with an sd of 2 minutes.  What time should you leave home to arrive to work on time at 8:00?  (You may have to decide a reasonable value for the chance of being late.)

Goals:     Master normal calculations.  Realize that summarizing using the normal curve is the ultimate reduction in complexity, but only applies to data whose distribution is actually bell-shaped.

Skills:

á                        Memorize 68-95-99.7 rule.  While we do rely on our technology to calculate areas under normal curves, it is convenient to have some of the values committed to memory.  These values can be used as rough guidelines; if precision is required, you should use the TI-83 instead.  I will assume you know these numbers by heart when we encounter the normal numbers again in chapters 10 and 13 through 19.

á                        Understand that summarizing with just the mean and standard deviation is a special case.  We have progressed from pictures like histograms to summary statistics like medians, means, etc. to finally summarizing an entire list with just the mean and the standard deviation.  However, this last step in our summarization only applies to lists whose distribution resembles the bell-shaped normal curves.  If the data's distribution is skewed, or has any other shape, this level of summarization is incomplete.  Also, it is important to realize that these calculations are only approximations.

Reading: Chapters 1 through 3

 

Day 9

Activity: Presentations.  Graphical (Chapter 1) and Numerical (Chapter 2) Summaries
Collect or find some data; the quality of the data is not important for this project.  Use 3 to 5 lists of data; make sure you have enough data so that your summaries are meaningful, say at least 20 cases.  Summarize your data using both graphical and numerical summaries.  Also, make sure you have at least one categorical variable and at least one numerical variable.

Reading: Chapters 1 through 3

 

Day 10

Activity: Exam 1.  This first exam will cover graphical summaries (pictures), numerical summaries (summary calculations) and normal curve calculations (areas under the bell curve).  Some of the questions will be multiple choice.  Others will require you to show your worked out solution.  Chapter reviews are an excellent source for studying for the exams.  Don't forget to review your class notes and recall what we saw in the videos.

 

Day 11

Activity: 1) Using the monarchs data, plot "years reigned" versus "death age".  Then guess what the correlation coefficient might be using your calculator.  Use the sample diagrams on page 92 to guide you.  Finally, using your calculator, calculate the actual value for the correlation coefficient and compare it to your guess.
2) Outlier effects.  With the dataset I give you in class, add an eighth point in three different places and observe how the correlations coefficient changes.

Goals:     Display two variables and measure (and interpret) linear association using the correlation coefficient.

Skills:

á                        Plot data with a scatterplot.  This will be as simple as entering two lists of numbers into your TI-83 and pressing a few buttons, just as for histograms or boxplots.  Or, if you are doing plots by hand you will have to first choose an appropriate axis scale and then plot the points.  You should also be able to describe overall patterns in scatter diagrams and suggest tentative models that summarize the main features of the relationship, if any.

á                        Use the TI-83 to calculate the correlation coefficient.  We will have to use the regression function STAT CALC LinReg(ax+b) to calculate correlation, r.  First, you will have to have pressed DiagnosticOn.  Access this command through the CATALOG (2nd 0).  If you type ENTER after the LinReg(ax+b) command, the calculator assumes your lists are in columns L1and L2; otherwise you will type where they are, for example LinReg(ax+b) L2, L3.

á                        Interpret the correlation coefficient.  You should know the range of the correlation coefficient (-1 to +1) and what a 'typical' diagram looks like for various values of the correlation coefficient.  Again, page 92 is your guide.  You should recognize some of the things the correlation coefficient does not measure, such as the strength of a non-linear pattern.  You should also recognize how outliers influence the magnitude of the correlation coefficient.  One simple way to observe the effects of outliers is to calculate the correlation coefficient with and without the outlier in the dataset and compare the two values.  If the values vary greatly (this is a judgment call) then you would say the outlier is "influential".

Reading: Chapter 4

 

Day 12

Activity: Video 4 – Manatees.  Correlation summary.
1)  The variables can be entered in any order; correlation is a fact about a pair
of variables.  This will be different when we get to regression; there, the order the variables are presented matters.
2)  We must have numerical
variables to calculate correlation.  For categorical variables, we will use contingency tables, in Chapter 6.
3)  High correlation does not necessarily mean a straight line scatterplot.  US population growth is an example.
4)  Correlation is not resistant; the dataset from Day 11 showed that the placement of a single point in the scatterplot can greatly influence the value of the correlation.

Goals:     See scatterplots and correlation in practice.  Understand correlations limitations and features.

Skills:

á                        Recognize the proper use of correlation, and know how it is abused.  Correlation measures straight line relationships.  Any departures from that model make the correlation coefficient less reliable as a summary measure. Just as for the standard deviation and the mean, the correlation coefficient is affected by outliers.  Therefore, it is extremely important to be aware of data that is unusual.  Some 2-dimensional outliers are hard to detect with summary statistics; scatterplots are a must then.

Reading: Chapter 5

 

Day 13

Activity: 1)  Using the Olympic data, fit a regression line to predict the 2004 and 2008 race results.
2)  Revisit outliers dataset, adding regression lines.

Goals:     Practice using regression with the TI-83.  We want the regression equation, the regression line superimposed on the plot, the correlation coefficient, and we want to be able to use the line to predict new values.

Skills:

á                        Fit a line to data.  This may be as simple as 'eyeballing' a straight line to a scatter plot.  However, to be more precise, we will use least squares, STAT CALC LinReg(ax+b) on the TI-83, to calculate the coefficients, and VARS Statistics EQ RegEQ to type the equation in the Y= menu.  You should also be able to sketch a line onto a scatter plot (by hand) by knowing the regression coefficients.

á                        Interpret regression coefficients.  Usually, we want to only interpret slope, and slope is best understood by examining the units involved, such as inches per year or miles per gallon, etc.  Because slope can be thought of as "rise" over "run", we are looking for the ratio of the units involved in our two variables.  More precisely, the slope tells us the change in the response variable for a unit change in the explanatory variable.  We don't typically bother interpreting the intercept, as zero is often outside of the range of experimentation.

á                        Estimate/predict new observations using the regression line.  Once we have calculated a regression equation, we can use it to predict new responses.  The easiest way to use the TI-83 for this is to TRACE on the regression line.  You may need to use up and down arrows to toggle back and forth from the plot to the line.  You may also just use the equation itself by multiplying the new x-value by the slope and adding the intercept.  (This is exactly what TRACE is doing.)

á                        Understand the limitations and strengths of linear regression.  Quite simply, linear regression should only be used with scatterplots that are roughly linear in nature.  That seems obvious.  However, there is nothing that prevents us from calculating the numbers for any data set we can input into our TI-83's.  We have to realize what our data looks like before we calculate the regression; therefore a scatter plot is essential.  In the presence of outliers and non-linear patterns, we should avoid drawing conclusions from the fitted regression line.

Reading: Chapter 5

 

Day 14

Activity: Try to summarize and predict the population growth in the US.  Using the census data, see if any of the other regression functions in the STAT CALC menu are good models. 

Goals:     Explore non-linear regressions on the TI-83.

Skills:

á                        Effectively model using non-linear regression functions.  When we have a relationship that is non-linear, we try other models.  Because straight lines are easy for us to understand (we are accustomed to them), the coefficients have meaning.  Some of the other functions available to you are also interpretable, with some familiarity (which I am not expecting from you) but others have coefficients that are uninterpretable.  Our main use of these alternate functions is to see the fitted model on the scatterplot.  (We add them to the scatterplot in the same way as for linear regression: VARS Statistics EQ RegEQ from the the Y= menu.)

á                        Understand that a high value of r2 is not necessarily a good fit.  We have seen that when r2 = 1, we have a perfect fit.  So, you might assume that values very close to 1 are indicators of very good fits, but this is not necessarily the case.  The population data should show us some high values of r2 that are poor predictive models.  Again, we need the scatterplot along with the equation to make proper conclusions.

Reading: Chapter 6

 

Day 15

Activity: Video 5 – Smoking.  Introduce tables of categorical data.

Goals:     Introduce association for categorical variables.  Explore Simpson's paradox.

Skills:

á                        Understand that cause and effect is difficult to establish.  The slogan is "Association is not the same as Causation."  We will encounter this many times throughout the rest of the course.  In the next set of material (Chapters 7 and 8) we will discuss ways to produce data from which we can draw conclusions about causation.

á                        Recognize Simpson's paradox.  Sometimes when data is summarized over several sub-categories, an association can be reversed.  It seems contrary to good common sense, but it is actually the effects of a lurking variable, and the phenomenon is known as Simpson's paradox.  You should be able to recognize situations where this paradox might occur.  Not all tables of categorical variables will exhibit this paradox; the tables must be comparing rates over several groups.

Reading: Chapter 6

 

Day 16

Activity: Expected Tables.

Goals:     Develop intuition for when the observed and expected tables are too different.

Skills:

á                        Create the table of expected counts.  The primary method of analyzing categorical tables is comparing the observed data to a table of expected counts.  (This material comes from Chapter 20, but I will not expect you to master Chapter 20.) 

á                        Recognize when an association is present.  When two categorical variables are associated (much like when two numerical variables are correlated) we detect this with the c2 test.  I will show you a way to decide if the differences in the tables are too great (STAT TESTS c2-Test  You must have the observed table in a matrix.  The expected table will be stored in another matrix.  If p < .05, we conclude the two tables are quite different.)

Reading: Chapters 4 through 6

 

Day 17

Activity: Presentations.  Regression/Correlation (Chapters 4 and 5)
Pick one of the 50 states.  Predict the population in the year 2010 using a regression function (not necessarily linear though).  Describe how you decided upon your model, and explain how good you think your prediction is.

Reading: Chapters 4 through 6

 

Day 18

Activity: Exam 2.  This second exam covers scatterplots, correlation, regression, and associations in categorical data.  Some of the questions will be multiple choice.  Others will require you to show your work.  Chapter reviews are an excellent source for studying for the exams.  Don't forget to review your class notes and recall what we saw in the videos.

 

Day 19

Activity: Video 6 – Frito Lay.  History of polls.

Goals:     Introduce sampling.  Identify biases.  Explore why non-random samples are not trustworthy.

Skills:

á                        Understand the issues of bias.  We seek representative samples.  The "easy" ways of sampling, samples of convenience and voluntary response samples, may or may not produce good samples, and because we don't know the chances of subjects being in such samples, they are poor sampling methods.  Even when probability methods are used, biases can spoil the results.  Avoiding bias is our chief concern in designing surveys.

á                        Huge samples are not necessary.  One popular misconception about sampling is that if the population is large, then we need a proportionately large sample.  This is just not so.  My favorite counter-example is our method of tasting soup.  To find out if soup tastes acceptable, we mix it up, then sample from it with a spoon.  It doesn't matter to us whether it is a small bowl of soup, or a huge vat, we still use only a spoonful.  The situation is the same for statistical sampling; we use a small "spoon", or sample.  The fundamental requirement though is that the "soup" (our population) is "well mixed" (as in a simple random sample – see Day 20).

Reading: Chapter 7

 

Day 20

Activity: Creating random samples.  We will use three methods of sampling today: dice, Table B in our book, and our calculator.  To make the problem feasible, we will only use a population of size 6.  (I know this is unrealistic in practice, but the point today is to see how randomness works, and hopefully trust that the results extend to larger problems.)  Pretend that the items in our population (perhaps they are people) are labeled 1 through 6.  For each of our methods, you will have to decide in your group what to do with "ties".  Keep in mind the goal of simple random sampling: at each stage, each remaining item has an equal chance to be the next item selected.

Using dice, generate a sample of three people.  Repeat 20 times.

Using Table B, starting at any haphazard location, select three people.  Repeat 20 times.

Using your TI-83, select three people.  The command
randInt(2,4,5) will produce 5 numbers between 2 and 4, inclusive, for example.

Your group should have drawn 60 samples at the end.  Keep careful track of which samples you selected; record your results in order, as 125 or 256, for example.  (125 would mean persons 1, 2, and 5 were selected.)  We will pool the results of everyone's work together on the board.

Goals:     Gain practice taking random samples.  Understand what a simple random sample is.  Become familiar with randInt(.  Accept that calculator is random.

Skills:

á                        Know the definition of a Simple Random Sample (SRS).  Simple Random Samples can be defined in two ways:
1)  An SRS is a sample where, at each stage, each item has an equal chance to be the next item selected.
2)  A scheme were every possible sample has an equal chance to be the
sample results in an SRS.

á                        Select an SRS from a list of items.  The TI-83 command randInt( will select numbered items from a list randomly.  If a number selected is already in the list, ignore that nu