Day By Day Notes for MATH 301

Fall 2007

Day 1

Activity:    Go over syllabus.  Take roll.  Overview examples: Randomness - coin example.  Gilbert trial.  Random Sampling.

I have divided this course into three units.  Unit 1 (Days 1 through 9) is about summarizing data and basic probability.  Unit 2 (Days 10 through 19) is about various common distributions and sampling distributions.  Unit 3 (Days 21 through 27) is about statistical inference.  One common theme throughout the course is mathematical modeling.  We are often trying to explain and predict what we see in the real world with equations and models.  Before we can say that a model is appropriate, we must understand the consequences of the equations we have chosen.  In Unit 2, we will explore a number of such models and their effects.  We will use techniques from Unit 1 to describe our models and ultimately we explore the ideas in Unit 3 to use our models in real life situations.  I will try to keep us focused on the "big picture" of statistics as a discipline as we proceed, but sometimes our focus will be on the details of algebra or calculus to get us over some hurdles.

I use the Gilbert trial example on the first day because it demonstrates all three of the course's main ideas in action.  The charts we see are examples of how to organize and summarize information that may be complicated by several variables or dimensions (Unit 1).  The argument about whether we would see results such as this one if there really were no relationship between the variables is an example of the probability we will study in Unit 2.  And the trial strategy itself is what statistical inference is all about (Unit 3).  Many of you will encounter inference when you read professional journals in your field and researchers use statistics to support their conclusions of improved treatments or to estimate a particular proportion or average.

I believe to be successful in this course, you must actually read the text and these notes carefully, working many problems.  The most important thing is to engage yourself in the material.  However, our class activities will often be unrelated to the homework you practice and/or turn in for the homework portion of your grade; instead they will be for understanding of the underlying principles.  For example, later today we will simulate Simple Random Sampling by taking 60 samples from each group in the class.  This is something you would never do in practice, but which I think will demonstrate several lessons for us.  In these notes, I will try to point out to you when we're doing something to gain understanding, and when we're doing something to gain skills.

Each semester, I am disappointed with the small number of students who come to me for help outside of class.  I suspect some of you are embarrassed to seek help, or you may feel I will think less of you for not "getting it" on your own.  Personally, I think that if you are struggling and cannot make sense of what we are doing, and don't seek help, you are cheating yourself out of your own education.  I am here to help you learn statistics.  Please ask questions when you have them; there is no such thing as a stupid question.  Often other students have the same questions but are also too shy to ask them in class.  If you are still reluctant to ask questions in class, come to my office hours or make an appointment.  Incidentally, when I first took statistics, I didn't understand it all on my own either, and I too didn't go to the instructor for help.  I also didn't get as high a grade as I could have!

I believe you get out of something what you put into it.  Very rarely will someone fail a class by attending every day, doing all the assignments, and working many practice problems; typically people fail by not applying themselves enough - either through missing classes, or by not allotting enough time for the material.  Obviously I cannot tell you how much time to spend each week on this class; you must all find the right balance for you and your life's priorities.  One last piece of advice: don't procrastinate.  I believe statistics is learned best by daily exposure.  Cramming for exams may get you a passing grade, but you are only cheating yourself out of understanding and learning.

Note: In these notes, I will put the daily task in gray background.

Creating random samples.  The text briefly discusses collecting random samples.  I want us to gain some practical experience collecting real simple random samples, so we will use three methods of sampling today: dice, a table of random digits, and our calculator.  To make the problem feasible, we will only use a population of size 6.  (I know this is unrealistic in practice, but the point today is to see how randomness works, and trust that hopefully the results extend to larger problems.)  Pretend that the items in our population (perhaps they are people) are labeled 1 through 6.  For each of our methods, you will have to decide in your group what to do with "ties".  Keep in mind the goal of simple random sampling: at each stage, each remaining item has an equal chance to be the next item selected.

By rolling dice, generate a sample of three people.  (Let the number on the die correspond to one of the items.)  Repeat 20 times, giving 20 samples of size 3.

Using the table of random digits, starting at any haphazard location, select three people.  (Let the random digit correspond to one of the items.)  Repeat 20 times, giving 20 more samples of size 3.

Using your calculator, select three people.  The TI-83 command MATH randInt( 2,  4,  5 ) will produce 5 numbers between 2 and 4, inclusive, for example.  (If you leave off the third number, only one value will be generated.)  If your calculator has a rand function only, you can achieve the same result as the TI-83 MATH randInt( 2, 4 ) with int( 3*rand) + 2. Repeat 20 times, giving 20 more samples of size 3.

Your group should have drawn 60 samples at the end.  Keep careful track of which samples you selected; record your results in order, as 125 or 256, for example.  (125 would mean items 1, 2, and 5 were selected.)  We will pool the results of everyone's work together on the board.

Goals:  (In these notes, I will summarize each day's activity with a statement of goals for the day.)

Review course objectives: collect data, summarize information, model with probability, make inferences.

Gain practice taking random samples.  Understand what a simple random sample is.  Become familiar with randInt(.  Accept that calculator is random.

Skills:  (In these notes, each day I will identify skills I believe you should have after working the day's activity, reading the appropriate sections of the text, and practicing exercises in the text.

á               Know the definition of a Simple Random Sample (SRS).  Simple Random Samples can be defined in two ways:  1)  An SRS is a sample where, at each stage, each item has an equal chance to be the next item selected.  2)  A scheme where every possible sample has an equal chance to be the sample results in an SRS.

á               Select an SRS from a list of items.  The TI-83 command randInt( will select numbered items from a list randomly.  If a number selected is already in the list, ignore that number and get a new one.  Remember, as long as each remaining item is equally likely to be chosen as the next item, you have drawn an SRS.

á               Understand the real world uses of SRS.  In practice, simple random samples are not that common.  It is just too impractical (or impossible) to have a list of the entire population available.  However, the idea of simple random sampling is essentially the foundation for all the other types of sampling.  In that sense then it is very common.

Reading:    (The reading mentioned in these notes refers to what reading you should do for the next day's material.)

Chapter 1.  Section 2.1.  Also bring a printed copy of the Arizona Temps dataset, either from the link or at the end of this document.

Day 2

Activity:    Numerical summaries of data.  Dance Fever example.  Arizona Temps.

As we begin Unit 1, let's look at the "big picture" first.  When we choose a model to describe a real world phenomenon, we have to be able to describe and summarize what we have.  Sometimes this will involve numerical calculations; other times we use pictures and graphs.  Still other times we will use advanced mathematics like calculus.  The first methods we will look at are the numerical measures.  I hope that you are already familiar with some of these measures, such as the mean and median.  Others, like the standard deviation, will perhaps be mysterious.  Of course our goal is to unmask the mystery!

We often want two measurements about numerical data: the center and the spread.  The reason for these two kinds of measurements will be clearer after we see the Central Limit Theorem in Unit 2.  Briefly, "center" is the typical data value, and "spread" refers to how variable the data values are.  We will first practice using technology to produce the calculations for us.  Along the way, I hope we gain some understanding too.

Use the Arizona Temps dataset to calculate means, standard deviations, and the 5-number summaries.  To calculate our summary statistics with the TI-83, we will use STAT CALC 1-Var Stats (to use List 1) or STAT CALC 1-Var Stats L2 for List 2, for example.  There are two screens of output; we will be mostly concerned with the mean , the standard deviation Sx, and the five-number summary on screen two.

As you explore using technology to summarize data, answer these questions:

1)     Are high and low temperatures distributed the same way, other than the obvious fact that highs are higher than lows?  (When we talk of a distribution we mean a description of how the data values are centered and how they are spread out.  Sometimes this will be a simple statement; other times we need to see a graph.)

2)    How does a single case affect the calculator's routines?  (What if there was an outlier in the list?  If you didn't have an outlier, change one of the values to a ridiculously large value and recalculate the graphs and measures.  Notice the effects of these extreme values.)

3)    What information does the 5-number summary disguise?

Now, create the following lists:

1)     A list of 10 numbers that has only one number below the mean.  (This seems to violate our interpretation of the average as being "representative".  You should understand how the mean might mislead us!)

2)    A list of 10 numbers that has the standard deviation greater than the mean.  (Note that the standard deviation seems to be more affected by outlying values than the mean is.)

3)    A list of 10 numbers that has a standard deviation of zero.

4)    For your fourth list, start with any 21 numbers.  Find a number N such that 14 of the numbers in your list are within N units of the average.  For example, pick an arbitrary number N (say 4), calculate the average plus 4, the average minus 4, and count how many numbers in your list are between those two values.  If the count is less than 14, try a larger number for N (bigger than 4).  If the count is more than 14, try a smaller number for N (smaller than 4).  Continue guessing and counting until you get it close to 14 out of 21.  This number you settle on should be fairly close to the standard deviation.

Finally, compare the standard deviation to the Interquartile Range (IQR = Q3 - Q1).

Goals:     Compare numerical measures of center and spread.  Use technology to summarize data with numerical measures.  Interpret standard deviation as a measure of spread.

Skills:

á               Understand the effect of outliers on the mean.  The mean (or average) is unduly influenced by outlying (unusual) observations.  Therefore, knowing when your distribution is skewed is helpful.

á               Understand the effect of outliers on the median.  The median is almost completely unaffected by outliers.  For technical reasons, though, the median is not as common in scientific applications as the mean.

á               Use the TI-83 to calculate summary statistics.  Calculating may be as simple as entering numbers into your calculator and pressing some buttons: STAT CALC 1-Var Stats.  Or, if you are doing some things by hand, you may have to organize information the correct way, such as listing the numbers from low to high.  Please get used to using the statistical features of your calculator to produce the means, standard deviations, etc.  While I know you can calculate the mean by simply adding up all the numbers and dividing by the sample size, you will not be in the habit of using the full features of your machine, and later on you will be "missing the boat".

á               Understand standard deviation.  At first, standard deviation will seem foreign to you, but I believe that it will make more sense the more you become familiar with it.  In its simplest terms, the standard deviation is non-negative number that measures how "wide" a dataset is.  One common interpretation is that the range of a dataset is about 4 or 5 standard deviations.  Another interpretation is that the standard deviation is roughly ¾ times IQR; that is the standard deviation is a bit smaller than the IQR.  Eventually we will use the standard deviation in our calculations for statistical inference; until then, this measure is just another summary statistic, and getting used to this number is your goal.  The normal curve of Chapter 3 will further help us understand standard deviation.

Reading:    Sections 2.2 to 2.4.

Day 3

Activity:    Graphical summaries of data.

Numerical summaries may oversummarize.  That is, important information may be lost.  An alternate approach to just a few summary statistics is a graph of the data, highlighting the key features.  We will look at four techniques today, each with their own strengths and weaknesses.

The stem plot is a hand technique, most useful for small (under 40 values) data sets.  It is basically a quick way to make a frequency chart, but always with class intervals using the base 10 system.  This means the intervals will always be ten "units" wide, such as 10 to 19, or 1 to 9, or .01 to .09.  The "unit" chosen is called the stem, and the next digit after the stem is called the leaf.

The histogram is a picture of the location of data on a number line.  It is composed of rectangles whose areas reflect the relative frequency of data.  Area is the key idea, not height, although if all rectangles have the same width then area and height are proportional.  The vertical scale is density, or area per unit width.  One drawback of histograms is that knowing the relative frequency in an interval does not indicate where in the interval the data may be concentrated.  Thus, clustering cannot be determined completely with a histogram.

The box plot is a graph of the five-number summary.  The box portion is the middle half of the data, from the first quartile to the third quartile.  The whiskers are the lines drawn outward from the box, representing the upper and lower quarters of the data.

QUANTILE is a program I wrote that plots the sorted data in a list and "stacks" the values up.  This is known as a quantile plot.  Basically we are graphing the individual data value versus the rank, or percentile, in the data set.  Quantile plots always go up from left to right.  The syntax is PRGM EXEC QUANTILE ENTER.  The program will ask you for the list where you've stored the data.  A and B are temporary lists used by the program, so if you have data in these lists already, store them in another list before executing.

Generally, to view box plots, histograms, scatter plots, and (later) normal probability plots, we have three chores to perform before our machine will show us the graphical display we want.  We must: 1) Enter the data into the calculator, 2) Choose the right options for the display we want, and 3) Set up the proper window settings.  The commands to do these activities on the calculator are:

1)     STAT EDIT Use one of the lists to enter data, L1 for example; the other L's can be used too.  The L's are convenient work lists.  At times, you may find that you want more meaningful names.  One way to do this is to store the list in a new named list after entering numbers.  The syntax for this is L1 -> NEWL, assuming the data was entered in L1 and you want the new name to be NEWL.

2)    2nd STATPLOT 1 On Use this screen to designate the plot settings.  You can have up to three plots on the screen at once.  For histograms, we will only use one at a time.  For box plots, we often use multiple displays.

3)    ZOOM 9 This command centers the window "around" your data.  It is always a good idea to see what the WINDOW settings are.  If you then change any of the WINDOW settings, you will then press GRAPH to see the changes.  (If you use ZOOM 9 again, the changes you just made don't get used!)

To make the box plot, we use STAT PLOT Type and pick the 4th or 5th icons.  The fifth icon is the true box plot, but the fourth one (the modified box plot) has a routine built in to flag possible outliers.  I recommend using the modified box plot as it shows at least as much information as the regular box plot, but includes the potential outliers, as defined by this procedure.

Use the Arizona Temps dataset to practice creating the histograms, stem plots, box plots, and quantile plots for several lists.  Compare and interpret the graphs.  Identify shape, center, and spread.

Compare these measures with the corresponding numerical measures you calculated on Day 2.  Notice that the box plots and numerical measures cannot describe shape very well.  The histograms are hard to use to compare two lists.  The stem and leaf is difficult to modify.

Using the plots now instead of the summary statistics, answer these questions again:

1)     Are high and low temperatures distributed the same way, other than the obvious fact that highs are higher than lows?

2)    How does a single case affect the calculator's routines?  (What if we had had an outlier?)

3)    What information does the 5-number summary disguise?

Goals:     Be able to use the calculator to make a histogram, box plot, or a quantile plot.  Be able to make a stem plot by hand.

Skills:

á               Summarize data into a frequency table.  The easiest way to make a frequency table is to TRACE the boxes in a histogram and record the classes and counts.  You can control the size and number of the classes with Xscl and Xmin in the WINDOW menu.  The decision as to how many classes to create is arbitrary; there isn't a "right" answer.  One popular suggestion is try the square root of the number of data values.  For example, if there are 25 data points, use 5 intervals.  If there are 50 data points, try 7 intervals.  This is a rough rule; you should experiment with it.  The TI-83 has a rule for doing this; I do not know what the rule is.  You should experiment by changing the interval width and see what happens to the diagram.

á               Use the TI-83 to create an appropriate histogram, box plot, or quantile plot.  STATPLOT is our main tool for viewing distributions of data.  Histograms are common displays, but have flaws; the choice of class width is troubling as it is not unique.  The quantile plot is more reliable, but less common.  For interpretation purposes, remember that in a histogram tall boxes represent places with lots of data, while in a quantile plot those same high-density data places are steep.

á               Create a stem plot by hand.  The stem plot is a convenient manual display; it is most useful for small datasets, but not all datasets make good stem plots.  Choosing the "stem" and "leaves" to make reasonable displays will require some practice.  Some notes for proper choice of stems: if you have many empty rows, you have too many stems.  Move one column to the left and try again.  If you have too few rows (all the data is on just one or two stems) you have too few stems.  Move to the right one digit and try again.  Some datasets will not give good pictures for any choice of stem, and some benefit from splitting or rounding (see the examples on page 28).

á               Describe shape, center, and spread.  From each of our graphs, you should be able to make general statements about the shape, center, and spread of the distribution of the variable being explored.

á               Compare several lists of numbers using box plots.  For two lists, the best simple approach is the back-to-back stem plot.  For more than two lists, I suggest trying box plots, side-by-side, or stacked.  At a glance, then, you can assess which lists have typically larger values or more spread out values, etc.  To graph up to three box plots on the TI-83, enter a different list in each of the 3 plots you can display using STATPLOT.

á               Understand box plots.  You should know that the box plots for some lists don't tell the interesting part of those lists.  For example, box plots do not describe shape very well (apart from rough symmetry); you can only see where the quartiles are. Alternatively, you should know that the box plot gives a very good first impression.

Reading:    Sections 3.1 to 3.3

Day 4

Activity:    Sample Spaces.  Venn Diagrams.  Coins, Dice.  Pascal's Triangle.  Homework 1 due today.

Our text skims over basic probability, so I want to spend a few sessions on some additional groundwork.  This will include the sample space method of doing probability as well as some introductory combinatorics.  There is also some benefit to simulating experiments to estimate a probability empirically; unfortunately, unless the simulation is performed many thousands of times, such estimates are usually not reliable.

In our sample space method, we will list the equally likely ways the experiment could have come out, and then simply count which ones are favorable to our particular question.  Unfortunately, oftentimes the sample space is too large to list.  In such cases, it may still be beneficial to at least begin the list or to contemplate the process without actually completely listing the sample space.  We will practice these ideas with dice today.

Using either complete sampling spaces (theory) or simulation, find (or estimate) these chances:

1)     Roll two dice, one colored, one white.  Find the chance of the colored die being less than the white die.  Do this part both with a theoretical sampling space diagram and with a simulation of 100 dice rolls.  If you would rather use randInt( 1, 6 ) on your TI-83 instead of actually rolling dice, that would be fine.  ( (randInt( 1, 6 ) < randInt( 1, 6 ) ) will yield either a 0 (false) or a 1 (true).  Note that by continuing to hit ENTER the command will be repeated.  This way you do not have to retype for every trial.  If you'd like to do it all at once, you can use the sequence command seq ( ( (randInt( 1, 6 ) < randInt( 1, 6 ) ), X, 1, 100 ) -> L1.  I like to store the result in L1 and then use 1-Var Stats or a histogram to see what it looks like.  The 100 in the command is the number of times the operation will be repeated.  100 is probably not enough trials, but the TI-83 isn't really a computer, so putting in something like 1000 will probably crash it!

2)    Roll three dice and find the chance that the largest of the three dice is a 6.  (If two or more dice are tied for the largest, use that value; for example, the largest value when 6, 6, 4 is rolled is 6.)  If you would like to simulate this one, you can keep track of whether you have a six or not on three rolls using max( randInt( 1, 6, 3 ) ), and then putting it in the above seq command.  However, see the skill note below for simulations; you may need to do this too many times to make it worthwhile.

3)    Roll three dice and find the chance of getting a sum of less than 8.  I suggest drawing sample spaces to do this one.  You do not need to draw the entire sample space for three dice, but maybe by drawing a portion of it you will be able to see what you need to pay attention to.  I will give you hints in class when you get to this one if you ask me.  If you would like to simulate this one, you can keep track of the sum on three rolls using  sum( randInt( 1, 6, 3 ) ), and then putting it in the above seq command.  However, see the skill note below for simulations; you may need to do this too many times to make it worthwhile.

The Venn diagram is a way of partitioning a sample space into events with (usually) circles.  We can use shading to indicate intersections and unions, and we can even prove some results using such diagrams.  On Day 6 we will prove the De Morgan's rules using Venn diagrams.

Goals:     Create sample spaces.  Use Venn diagrams to organize sample spaces.  Use simulation to estimate probabilities.

Skills:

á               Know the definitions of Sample Space, Event, Outcome, etc.  The basic language of probability will be used throughout the course, so it is important for you to be conversant in it.  Basically, the sample space is a collection of outcomes, our basic smallest sets.  Events are collections of outcomes, and are usually subsets of the sample space, although the sample space itself is also an event.

á               Be able to use a Venn diagram.  The Venn diagram is a way of partitioning the sample space into mutually exclusive regions.  It can be useful for simply organizing sets, or sometimes is quite useful in understanding proofs (as we will see in the inclusion/exclusion formula on Day 6.)

á               List simple sample spaces.  Flipping coins and rolling dice are common events to us, and listing the possible outcomes lets us explore probability distributions.  We will not delve too deeply into probability rules; rather, we are more interested in the ideas of probability and I think the best way to accomplish this is by example.

á               Simulation can be used to estimate probabilities.  If the number of repetitions of an experiment is large, then the resulting observed frequency of success can be used as an estimate of the true unknown probability of success.  However, a "large" enough number of repetitions may be more than we can reasonably perform.  For example, for problem 1 today, a sample of 100 will give results between 32/100 and 51/100 (.32 to .51) 95% of the time.  That may not be good enough for our purposes.  Even with 500, the range is 187/500 to 230/500 (.374 to .460).  Eventually the answers will converge to a useful percentage; the question is how soon that will occur.  We will have answers to that question after Chapter 3.

á               Recognize the usefulness and properties of Pascal's Triangle.  Pascal's Triangle is old (known to the Indians, the Persians, and the Chinese as early as the 10th century) yet is still quite useful.  There are just two rules to construct Pascal's Triangle:  each row begins and ends with a 1, and each entry is the sum of the two entries above it to the left and the right.  From such a simple construction, though, we encounter many relationships: the combination formula, the triangular numbers, the Fibonacci numbers, powers of 2, among others.  Our chief interest is in the combination formula and its relationship to the binomial distribution.

Reading:    Page 95.

Day 5

Activity:    Presentation 1.

Summaries (Chapter 2)

Gather 3 to 5 variables on at least 20 subjects; the source is irrelevant, but knowing the data will help you explain its meaning to us.  Be sure to have at least one numerical and at least one categorical variable.  Demonstrate that you can summarize data graphically and numerically.

Combinations and Permutations: As we try to count outcomes in an event, to determine the probability of the event, we discover many useful patterns.  Pascal's triangle contains what are known as the binomial coefficients, or the combination formula.  The two main relationships we need from this material are nCr and nPr.  Basically, nCr counts subsets and nPr counts orderings of those subsets.

Goals:     Continue exploring Pascal's Triangle and how it relates to counting (permutations and combinations).

Skills:

á               Know the Permutation and Combination formulas.  When counting the number of ways of choosing items or ordering items, our formulas are nCr and nPr, respectively.  You will need to work enough problems so that you know when to use each of them.  One way to keep them straight is to think of a Combination as a Committee of people, and a Permutation as a Photograph of that committee.  (There are more permutations than combinations for a particular choice of n and r.)  Also don't forget our trick of listing the complete sample space, but only for small problems!

Reading:    Class Notes.

Day 6

Activity:    Finish Combinations and Permutations.

Arrange the letters in FREDA.  Arrange the letters in FREED.  Arrange the letters in ERRORS.  Arrange the letters in SETTER.

Basic probability rules:

1)     Probability is a number between 0 and 1, inclusive.

2)    Mutually Exclusive events add when finding the union.

3)    Mutually Exclusive and exhaustive events add to one.

Use Venn diagrams to "prove":

A = (AB) é (AB')

(AéB)' = A'B' and (AB)' = A'éB'.

Demonstrate the Inclusion/Exclusion formula with a 3-set Venn diagram.

Goals:     Know the rules of probability, including addition, complement, and inclusion/exclusion.  The multiplication rule will be covered on Day 7.

Skills:

á               Understand the probability rules.  Being adept at probability begins with knowing definitions and knowing basic formulas.  For example, you can't prove things about mutually exclusive sets if you can't recite the definition of mutually exclusive.  Memorize at first; later it becomes "learned", not "memorized".

á               Relate the rules to sample spaces.  Remember that the rules we're discussing are all based on counting elements in sample spaces.  Sometimes it is helpful to have a few "standard" examples in mind so conjectures or steps in reasoning can be verified.  For example, the inclusion exclusion principle is shown well with the two-dice problem "What is the chance of at least one six?"  Ignoring the intersection makes the probability too large.

á               Realize how the Venn diagram can help verify results.  The inclusion/exclusion formula is a good example where a Venn diagram can help with the proof or development.  Other examples are De Morgan's Laws.  For Bayes' formula, on Day 7, the Venn diagram will also be useful.

Reading:    Class Notes.

Day 7

Activity:    Constructing probability trees.  Demonstrating Bayes' with the rare disease problem.  Homework 2 due today or Friday.

Consider a card trick where two cards are drawn sequentially off the top of a shuffled deck.  (There are 52 cards in a deck, 4 suits of 13 ranks.)  We want to calculate the chance of getting hearts on the first draw, on the second draw, and on both draws.  We will organize our thoughts into a tree diagram, much like water flowing in a pipe.  On each branch, the label will be the probability of taking that branch; thus at each node, the exiting probabilities (conditional probabilities) add to one.

On the far right of the tree, we will have the intersection events.  Their probability is found by multiplying.  At each branch, we have a conditional probability, which is a probability calculation based on what has already occurred to get to where we are in the tree.  The notation we will use is P(A|B) which reads "the probability of event A given that event B has occurred".  The multiplication formula we use in trees is P(AB) = P(A) P(B|A).  Note that this can also be written with the roles of A and B reversed: P(AB) = P(B) P(A|B).  Combining these two formulas gives us the basis of Bayes' formula:  P(A) = P(B) P(A|B) / P(A).  I will demonstrate how to use these conditional formulas using the rare disease problem.

Another extremely important notion to our models we will create next chapter is the idea of independence.  There are two definitions you will encounter.  They are equivalent, and each follow from the other.  They are P(AB) = P(A) P(B) and P(A) = P(A|B).  The first definition is the one we will use most often, as it says that we can multiply the two probabilities of independent events together to get the joint probability.  We saw this already when we looked at the sample space for rolling two dice.

Calculate the chances of:

1)     Drawing a heart on the first card.

2)    Drawing a heart on the second card.

3)    Drawing at least one heart.

4)    Drawing two hearts.

5)    Drawing a heart on the second draw given that a heart was drawn first.

6)    Drawing a heart on the first draw given that a heart was drawn first.

Now we will do this work for the rare disease problem (Problem 2.128).

Goals:     Be able to express probability calculations as tree diagrams.  Be able to reverse the events in a probability tree, which is what Bayes' formula is about.

Skills:

á               Know how to use the multiplication rule in a probability tree.  Each branch of a probability tree is labeled with the conditional probability for that branch.  To calculate the joint probability of a series of branches, we multiply the conditional probabilities together.  Note that at each branching in a tree, the (conditional) probabilities add to one, and that overall, the joint probabilities add to one.

á               Recognize conditional probability in English statements.  Sometimes the key word is "given".  Other times the conditional phrase has "if".  But sometimes the fact that a statement is conditional is disguised.  For example:  "Assuming John buys the insurance, what is the chance he will come out ahead" is equivalent to "If John buys insurance, what is the chance he will come out ahead".

á               Be able to use the conditional probability formula to reverse the events in a probability tree.  The key here is the symmetry of the events in the conditional probability formula.  We exchange the roles of A and B, and tie them together with our formula for P(AB).

á               Know the definition of independence.  Independence is a fact about probability, not about sets.  Contrast this to "disjoint" which is a property of sets.  In particular, independent events are by definition not disjoint.  Independence is important later as an assumption as it allows us to multiply individual probabilities together without having to worry about conditional probability.

Reading:    Section 3.7.

Day 8

Activity:    Continue coins and dice.  Introduce Random Variables.

I like to introduce discrete probability distributions by continuing the probability discussions we've had already with dice and coins.  Essentially we can further partition the sample space by assigning a numerical value to each outcome, and then counting how many outcomes are in the events with equal assigned numbers.  A collection of these assigned numbers and their associated probability is what we call a probability distribution, or a probability mass fucntion (pmf).  If we cumulate these probabilities, we get the cumulative distribution function or cdf.

Answer the following questions, using complete distributions:

1)     What is the chance of getting a sum of 8 on two dice?

2)    What is the chance of getting a sum of 10 on two dice?

3)    What is the chance of getting a sum of x on two dice, where x is between 1 and 13?

4)    What is the chance of getting 10 heads on 20 flips of a fair coin?

5)    How can you get the TI-83 to graph a probability histogram?

Derive a pmf and its cdf.  Use the sum on two dice as an example.  Know how to work back and forth from one to the other.

We can use the Frequency option of the TI-83 to find m and s for discrete distributions.  This option allows different weights for the data entered.  When we used STAT CALC 1-Var Stats L1 before, the values in L1 were each representing one data value each.  In probability distributions, we don't have actual data, but possible values, the values of the Random Variable, and their associated probabilities.  To find the mean and standard deviation of the distribution, then, we let the probabilities be the weights.  We will practice with the dice rolling results we created today.

To summarize pmf's, we will typically calculate two values, the Expected Value and the Variance.  You should be familiar with Expected Value already; your GPA is an example.  I will go over a few EV calculations for the dice problems, then move on to other functions, like x2.  Finally, we will see that the Variance is a form of Expected Value.