Day By Day Notes for MATH 201

Spring 2007

Day 1

Activity:     Go over syllabus.  Take roll.  Overview example: Kristin Gilbert trial.

Goals:         Review course objectives: collect data, summarize information, and make inferences.

I have divided this course into three "units".  Unit 1 (Days 2 through 14) is about summarizing data.  Unit 2 (Days 15 through 29) is about sampling and probability.  Unit 3 (Days 30 through 42) is about statistical inference.  Throughout the course, we will also focus on the abuses people often make of statistics and statistical methods.  Sometimes these abuses are unintentional.  My hope is that at the end of the course, you will have an appreciation for the "tricks" that people use to lie to you using numbers.

I use the Gilbert trial example because it demonstrates all three of the course's main ideas in action.  The charts we see are examples of how to organize and summarize information that may be complicated by several variables or dimensions (Unit 1).  The argument about whether we would see results such as this one if there really were no relationship between the variables is an example of the probability we will study in Unit 2.  And the trial strategy itself is what statistical inference is all about (Unit 3).  Many of you will encounter inference when you read professional journals in your field and the researchers use statistics to support their conclusions of improved treatments or to estimate a particular proportion or average.

I believe to be successful in this course, you must READ the text carefully, working many practice problems.  The activities we will do in class will often be unrelated to the homework you practice and/or turn in for the homework portion of your grade; instead they will be for understanding of the underlying principles.  For example, on Day 17 we will simulate Simple Random Sampling by taking 60 samples from each group in the class.  This is something you would never do in practice, but which I think will demonstrate several lessons for us.  In these notes, I will try to point out to you when we're doing something to gain understanding, and when we're doing something to gain skills.

Each semester, I am disappointed with the small number of students who come to me for help outside of class.  I suspect some of you are embarrassed to seek help, or you may feel I will think less of you for not "getting it" on your own.  Personally, I think that if you are struggling and cannot make sense of what we are doing, and don't seek help, you are cheating yourself out of your own education.  I am here to help you learn statistics.  Please ask questions when you have them; there is no such thing as a stupid question.  Often other students have the same questions but are also too shy to ask them in class.  If you are still too shy to ask questions in class, come to my office hours or make an appointment.  Incidentally, when I first took statistics, I didn't understand it all on my own either, and I too didn't go to the instructor for help.  I also didn't get as high a grade as I could have!

I believe you get out of something what you put into it.  Very rarely will someone fail a class by attending every day, doing all the assignments, and working many practice problems; typically people fail by not applying themselves enough - either through missing classes, or by not allocating enough time for the material.  Obviously I cannot tell you how much time to spend each week on this class; you must all find the right balance for you and your life's priorities.  One last piece of advice: don't procrastinate.  I believe statistics is learned best by daily exposure.  Cramming for exams may get you a passing grade, but you are only cheating yourself out of understanding and learning.

Reading:    To The Student, pages xxxi-xxxiv.  Bring to class on Day 2 a list of numbers (the source is not relevant, but knowing where the numbers came from and what they represent will help when you try to explain them).

Day 2

Activity:     Discussion of variables and graphs.

Unit 1 is about summarization.  We will work on several types of summary:  graphical summaries of a list of numbers, graphical summaries of a list of characteristics, numerical summaries of a list of numbers, and the summaries for two variables.  Sometimes we will summarize situations with one or two numbers, called summary statistics.  Other times, we need more than that, such as a picture, or many summary statistics.  One of your tasks in this course is to gain enough experience to know what is enough summarization for a particular situation.

The first summaries we will work on are the "one-variable graphical techniques".  If you are working with a list of characteristics (attribute data or categorical data), you will need to use a bar graph, or a frequency table, or a pie graph.  If you are working with a list of numbers, then you have many more options: frequency tables, histograms, quantile plots, stem plots, box plots (later), to name a few.  Your goal in Unit 1 is to become adept at producing these displays, either by hand or with the calculator, and to interpret the main features from either your own creations or someone else's work.

In these notes, I will put the daily task in yellow background.

Your task today is, from your list of numbers, to communicate the important information to the person next to you.  (Work in pairs or groups.)  Specifically, make and interpret a frequency table and a histogram.  In your description to your neighbor, keep in mind these terms: symmetry, skew, center, spread, mode, outlier.  On your calculator, make sure that you try different window settings for your histogram.  Remember, you are trying to describe your data to your neighbor without having to tell them every single value; you are trying to point out the main features that you have identified.  (For example, deciding whether some list is "skewed" or "symmetric" will often be a matter of opinion; you need to look at many lists before you can make good judgment calls.)

For our next class day, Day 3, print out the Arizona Temps data set and bring your copy to class.

In these notes, I will put sections of computer commands in boxes, like this one.

Frequently we will use the TI-83 to make our work easier.  At first, you may think of the machines as burdens, confusing and intimidating.  I believe that if you stick with it and become experienced with the tools, you will discover that the calculators are indispensable for calculating statistics.  I will show you what I can about the TI-83's, but it is up to you to practice using them.  I cannot do that part for you, of course.

Generally, we have three chores to perform before our machine will show us the graphical display we want.  We must: 1) Enter the data into the calculator.  2) Choose the right options for the display we want.  And 3) Set up the proper window settings.  The commands to do these activities on the calculator are:

1) STAT EDIT  Use one of the lists to enter data, L1 for example; the other L's can be used too.  The L's are convenient work lists.  At times, you may find that you want more meaningful names.  One way to do this is to store the list in a new named list after entering numbers.  The syntax for this is L1 -> NEWL, assuming the data was entered in L1 and you want the new name to be NEWL.

2) 2nd STATPLOT 1 On  Use this screen to designate the plot settings.  You can have up to three plots on the screen at once.  For histograms, we will only use one at a time.  Later, when we see box plots, we will make multiple displays.

3) ZOOM 9  This command centers the window "around" your data.  It is always a good idea to see what the WINDOW settings are.  If you then change any of the WINDOW settings, you will then press GRAPH to see the changes.

Goals:  (In these notes, I will summarize each day's activity with a statement of goals for the day.)

                       Begin graphical summaries (describing data with pictures).  Be able to use the calculator to make a histogram.

Skills:  (In these notes, each day I will identify skills I believe you should have after working the day's activity, reading the appropriate sections of the text, and practicing exercises in the text.

*                    Identify types of variables.  To choose the proper graphical displays, it is important to be able to differentiate between Categorical (or Qualitative) and Quantitative (or Numerical) variables.  Of course, it is also necessary to know exactly what a variable is.  Quite simply, a variable is an attribute about a collection of subjects.  As an example, for the students in this class, we could measure your height (a quantitative/numerical variable) or your major (a qualitative/categorical variable).

*                    Be familiar with types of graphs.  To graph categorical variables we use bar graphs or pie graphs.  To graph numerical variables, we use histograms, stem plots, or QUANTILE plots (using a TI-83 program we will explore on Day 3).  In practice, most of our variables will be numerical but it is still important to know the categorical displays.

*                    Summarize data into a frequency table and produce a histogram.  A frequency table is a list of class intervals (sometimes called bins or classes) and the number of data values in the class intervals.  You can make a frequency table yourself by just counting how many data values are in each interval, but the easiest way is to first use the TI-83 to make a histogram and then to TRACE over the boxes and record the classes and counts.  In fact, the histogram is just a "picture" of the frequency table.

*                    Know how to modify the TI-83 default histogram.  You can control the size and number of the classes with Xscl and Xmin in the WINDOW menu.  The decision as to how many classes to create is arbitrary; there isn't a "right" answer, or rather all choices of Xscl and Xmin are "right" answers.  You should experiment every time you use the TI-83 to make a histogram by changing the interval width Xscl and starting point Xmin to see what happens to the display.

*                    Know how to create and interpret graphs for categorical variables.  The two main graphs for categorical variables are pie graphs and bar charts.  Pie graphs are difficult to make by hand, but are popular on computer programs like Excel.  Bar charts are also common on spreadsheets.  Data represented by pie graphs and bar charts usually are expressed as percents of the whole; thus they add to 100 %.  The ordering for categories is arbitrary; therefore concepts such as skew and center make no sense.

Reading:    Section 1.1. (Skip Time Plots and Time Series.)

Day 3

Activity:     Creating and interpreting stem plots, histograms, and quantile plots.

Use the Arizona Temps data set to practice creating the histograms, stem plots, and quantile plots for several lists.  Compare and interpret the graphs.  Identify shape, center, and spread.  Today's new methods, the stem plot and the quantile plot, are alternate ways to summarize a list of numbers.

Answer these questions about the Arizona Temps:

1)   Do any of the lists have outliers?  (There is not a firm definition of exactly what an outlier is.  You will have to develop some personal judgment; the best way is to look at many lists and displays.)

2)      What information does the stem plot show that the histogram hides?

3)      What information does the quantile plot show that the stem plot and histogram hide?

4)      Which display is most "trustworthy"?  That is, which one has the smallest likelihood of misleading you?

The stem plot is a hand technique, most useful for small (under 40 values) data sets.  It is basically a quick way to make a frequency chart, but always using class intervals based on the base 10 system.  This means the intervals will always be ten "units" wide, such as 10 to 19, or 1 to 9, or .01 to .09.  The "unit" chosen is called the stem, and the next digit after the stem is called the leaf.

QUANTILE is a program I wrote that plots the sorted data in a list and "stacks" the values up.  This is known as a quantile plot.  Basically we are graphing the data value versus the rank, or percentile, in the data set.  The syntax is PRGM EXEC QUANTILE ENTER.  The program will ask you for the list where you've stored the data.  A and B are temporary lists used by the program, so if you have data in these lists already, store them in another list before executing.

Goals:         Be able to use make and interpret a quantile plot, using the TI-83 program QUANTILE.  Be able to make and interpret a stem plot by hand.

Skills:

*                    Use the TI-83 to create an appropriate histogram or quantile plot.  STAT PLOT and QUANTILE are our two main tools for viewing distributions of data on the TI-83.  Histograms are common displays, but have flaws; the choice of class intervals is troubling as it is not unique.  The quantile plot is more reliable, but less common.  For interpretation purposes, remember that in a histogram tall boxes represent places with lots of data, while in a quantile plot those same high-density data places are represented by steepness.

*                    Create a stem plot by hand.  The stem plot is a convenient manual display; it is most useful for small data sets, but not all data sets make good stem plots.  Choosing the "stem" and "leaf" to make reasonable displays will require some practice.  Some ideas for a proper choice of stems: if you have many empty rows, you have too many stems.  Move one column to the left and try again.  If you have too few rows (all the data is on just one or two stems) you have too few stems.  Move to the right one digit and try again.  Some data sets will not give good pictures for any choice of stem, while some benefit from splitting or rounding (see the example on page 13).

*                    Describe shape, center, and spread.  From each of your graphs, you should be able to make general statements about the shape, center, and spread of the distribution of the variable being explored.  One of the main conclusions we want to make about lists of data when we are doing inference (Chapters 6 to 8) is whether the data is close to symmetric; many times "close enough" is, well, close enough!  We will discuss this in more detail when we see the Central Limit Theorem in Chapter 5.

Reading:    Section 1.2.

Day 4

Activity:     Dance Fever example.  Using the TI-83 to calculate summary statistics and to make the box plots display.

In addition to graphical displays, we often want to summarize a list of numbers with numerical measures.  You are already familiar with the most famous of these, the average or mean.  Less familiar, but just as important statistically, is the standard deviation, which measures how much the data are spread out.

Use the Arizona Temps data set to calculate the mean, the standard deviation, the 5-number summary, and the associated box plot for any of the variables.

Compare the box plots and numerical measures with the corresponding histograms and quantile plots you made on Day 2.  Note the similarities (where the data values are dense, and where they are sparse) but especially note the differences.  The box plots and numerical measures cannot describe shape very well.  On the other hand, histograms are messy to use to compare two lists.  The stem and leaf is tedious to modify.

Answer these questions about the Arizona Temps:

1)   Are high and low temperatures distributed the same way, other than the obvious fact that highs are higher than lows?  (When we talk of a distribution we mean a description of how the data values are centered and how they are spread out.  Sometimes this will be a simple statement; other times we need to see a graph.)

2)   How does a single case influence the calculator's answers?  (What if there was an outlier in the list?  If you didn't have an outlier, change one of the values to a ridiculously large value and recalculate the graphs and measures.  Notice the effects of these extreme values.)

3)   What information does the box plot disguise?

To calculate our summary statistics, we will use 1-Var Stats (to use List 1) or 1-Var Stats L2 for List 2, for example.  There are two screens of output; we will be mostly concerned with the mean (pronounced "x bar"), the standard deviation Sx, and the five-number summary from screen two.

To make the box plot, we use STAT PLOT Type and pick the 4th or 5th icons.  The fourth icon is the true box plot, but the fifth one (the modified box plot) has a routine built in to flag possible outliers.  I recommend using the modified box plot as it shows at least as much information as the regular box plot, but includes the potential outliers, as defined by this procedure.

Goals:         Compare numerical measures of center.  Summarize data with numerical measures and box plots.  Compare these new measures with the histograms, stem plots, and quantile plots you made on Day 3.

Skills:

*                    Understand the effect of outliers on the mean and median.  The mean (or average) is unduly influenced by outlying (unusual) observations, whereas the median is not.  Therefore, knowing when your distribution is skewed is helpful.  The mean is attracted to the outliers, so for lists with a high outlier, the mean will be higher than the median, and higher than if the outlier is removed.  The degree of influence is a matter of judgment. The median is almost completely unaffected by outliers.  For technical reasons, though, the median is not as common in scientific applications as the mean.

*                    Use the TI-83 to calculate summary statistics.  Calculating summary statistics using the TI-83 is as simple as entering numbers into STAT EDIT and then choosing STAT CALC 1-Var Stats L#, where L# represents the list you are interested in.  If you are doing some work by hand, you may have to organize information in a correct way, such as listing the numbers from low to high to find the median.  I recommend that you get used to using the statistical features of your calculator to produce the mean.  While I realize many of you are in the habit of calculating the mean by adding up all the numbers and dividing by the sample size, you will not be using the full features of your machine.  Later on you will find yourself entering data twice: once to calculate the mean, then again when you discover you needed the standard deviation and the data should be in a list.

*                    Compare several lists of numbers using stem plots and box plots.  For two short lists, the best simple approach is the back-to-back stem plot.  For longer lists, or for more than two lists, use box plots, side-by-side, or stacked.  At a glance, you can then assess which lists have typically larger values or more spread out values, etc.  To graph up to three box plots on the TI-83, enter a different list in each of the 3 plots you can display using STAT PLOT.

*                    Understand box plots.  You should know that the box plots for some lists don't tell the interesting part of those lists.  For example, box plots do not describe the shape of a distribution very well; you can only see where the quartiles are.  Alternatively, you should know that the box plot gives a very good first impression.

Reading:    Section 1.2.

Day 5

Activity:     Interpreting standard deviation.

Today we will practice the ideas of mean and standard deviation by letting you play with some lists of your own.  There are many possible answers to some of the following questions; you should try to see if you understand why each of them may have many answers.  Also, the fourth list is an attempt to get you thinking about how spread out a data set is.  This is a requirement for understanding the standard deviation.  As usual, ask questions as you have them; we will be mostly working in groups today and I will listen to you work.

Create the following lists:

1)   A list of 10 numbers that has only one number below the mean.  (This seems to violate our interpretation of the average as being "representative".  You should understand how the mean might mislead us!)

2)   A list of 10 numbers that has the standard deviation greater than the mean.  (Note that the standard deviation seems to be more affected by outlying values than the mean is.)

3)   A list of 10 numbers that has a standard deviation of zero.

For your fourth list, start with any 21 numbers.  Find a number N such that 14 of the numbers in your list are within N units of the average.  For example, pick an arbitrary number N (say 4), calculate the average plus 4, the average minus 4, and count how many numbers in your list are between those two values.  If the count is less than 14, try a larger number for N (bigger than 4).  If the count is more than 14, try a smaller number for N (smaller than 4).  Continue guessing and counting until you get it close to 14 out of 21.  This number you settle on should be fairly close to the standard deviation.

Finally, compare the standard deviation to the Inter Quartile Range (IQR = Q3 - Q1).

(You may use any extra time today to discuss Presentation 1 in your groups.)

Goals:         Interpret standard deviation as a measure of spread.

Skills:

*                    Understand standard deviation.  At first, standard deviation will seem foreign to you, but I believe that it will make more sense the more you become familiar with it.  In its simplest terms, the standard deviation is non-negative number that measures how "wide" a data set is.  One common approximation is that the range of a data set is around 4 or 5 standard deviations.  Another approximation is that the standard deviation is roughly 3/4 of the IQR.  Eventually we will use the standard deviation in our calculations for statistical inference; until then, this measure is just another summary statistic, and getting used to this number is your responsibility.  The normal curve of the next section will further help us understand standard deviation.

Reading:    Section 1.3.

Day 6

Activity:     Introduce the TI-83's normal calculations.  Homework 1 due today.

So far, we have used graphs to summarize distributions and summary statistics to describe certain features such as center and spread.  Today we will look at a specific distribution that is so common it has a name, the normal distribution.  We will see that data following this curve are extremely easy to describe: we only need to know the mean and the standard deviation.  The difference between this new material and yesterday's work is that if the data really does follow the normal curve, we can draw the distribution's curve by knowing just those two measures.  In yesterday's work, although we knew the mean and the standard deviation, that still gave us no indication of the shape of the distribution.

Generally, finding areas under curves is a calculus problem; ideally you end up with a formula where you simply plug in the endpoints desired.  Unfortunately, the normal curve is an example of a function for which the calculus approach doesn't give a useful answer.  Therefore to find areas under the normal curve, we must use some sort of approximation.  The normal table in the book is one approximation.  I don't recommend using that table though, because our calculator is easier, and it happens to be more precise!  However, this means that as you read the problems in the textbook, you must skip over the descriptions given about using z-scores to find areas.  Also, remember this discrepancy when checking answers in the back of the book for the odd problems: often the book's answers are different from the calculator's answers because they used the (inaccurate) table instead of the (accurate) calculator.

My advice to you when calculating areas using the normal curve is to always draw a picture.  See your class notes for the actual diagram and the way to label it.  I will try to always put two scales on my normal curves: the z-scale and the scale of real units.  I do this so that I can make sense of answers I get.  For example, if the shaded area on my diagram is obviously less than half of the total area, and if my answer is 0.75, then that makes no sense, and I must have entered something into the calculator incorrectly (or drawn the picture wrong!).

Find the following areas:

1)     The area between -1 and +1 on the standard normal curve.