Lecture 12 - 188 200 discrete mathematics and linear algebra
Bayesian spam filtering for multiple keywords
Definition: If A and B are events in a samplespace S , then A and B are independent iff
What we are saying is the outcome of A doesn’tdepend in any way on the outcome of B, andconversely.
Example 1: Suppose a coin is tossed twice.
The first toss could turn H or T , and wouldnot depend on the outcome of the second toss.
The second toss could also turn H or T andwould not depend on the outcome of the firsttoss.
Does knowing that the coin comes up tail on thefirst toss help you predict the second toss?
S = {HH, HT , TH, TT }Let A = “Coin was tail on the first toss” ={TH, TT }Let B = “Coin was tail on the second toss” ={HT , TT }P(B) = 1/2
Conclusion: Knowing the outcome of the first tossdoes not help you guess the outcome of the secondtoss.
It would be natural to think that mutually disjointevents events would be independent, in fact almostthe opposite is true: Disjoint events with non-zeroprobabilities are dependent.
Example 2: Let A and B be events on S , andsuppose A ∩ B = ∅, P(A) = 0 and P(B) = 0. Showthat P(A ∩ B) = P(A) · P(B).
Because A ∩ B = ∅, P(A ∩ B) = 0.
But P(A) · P(B) = 0 because P(A) = 0 andP(B) = 0.
Example 3: Suppose that A is the event that arandomly generated bit string of length four beginswith a 1, and B is the event that this bit stringcontains an even number of 1s. Are A and Bindependent if all 4-bit strings are equally likely tooccur?
A ={1111, 1110, 1101, 1011, 1100, 1010, 1001, 1000}
B ={0000, 0011, 0101, 0110, 1001, 1010, 1100, 1111}
Since P(A ∩ B) = P(A) · P(B), A and B areindependent events.
Example 4: Assume that each of the four waysthat a family can have two children are equallylikely. Are the events E that a family with twochildren has two boys, and F that a family with twochildren has at least one boy independent?
Since 1/4 = 3/16, E and F , are notindependent.
Product Rule to Determine Probability ofCombinations of Events.
If probabilities are independent, we can use theproduct rule to determine the probabilities ofcombinations of events.
Example 5: What is the probability of flippingheads 4 times in a row using a fair coin?
so P(HHHH) = P(H) · P(H) · P(H) · P(H) =(1/2)4 = 1/16 because probabilities of flippinghead each time are independent.
Example 6: What is the probability of rolling thesame number 3 times in a row using an unbiased6-sided die?
First roll agrees with itself with probability 1/6.
Second roll agrees with first with probability1/6.
Third roll agrees with first two with probability1/6.
So probability of rolling the same number 6times is (1/6) · (1/6) · (1/6) = 1/36.
Problem: We want to create a spam filter usingkeywords.
Specifically, we want to develop a Bayesian filter fortwo keywords that tells us P[A | (B1 ∩ B2)]
B1 = “an email contains the first questionable
1. Events B1 and B2 are independent. 2. The events B1|A and B2|A are independent. 3. P(A) = P(A) = 0.5
= P[(B1 ∩ B2) | A]P(A) + P[(B1 ∩ B2) | A]P(A)
= P[(B1 ∩ B2) | A] + P[(B1 ∩ B2) | A]
and B2, and B1|A and B2|A are independent.
Example 7: Suppose that we train a Bayesianspam filter on a set of 2000 spam emails and 1000emails that are not spam. The word “viagra”appears in 400 spam emails and 60 good emails,and the word “discount” appears in 200 spamemails and 25 good emails. Estimate the probabilitythat a message containing the words viagra” and“discount” is spam. Will we reject this message ifour spam threshold is set at 0.9?
B1 = email contains the word “viagra”B2 = message contains the word “discount”
P(B1|A) = 400/2000 = 0.2P(B1|A) = 60/1000 = 0.06p(B2|A) = 200/2000 = 0.1p(B2|A) = 25/1000 = 0.025
P[A | (B1 ∩ B2)] = 0.2(0.1) + 0.06(0.025)
Conclusion: Since the probability that our email isspam given that it contains the string “viagra” and“discount” is approximately 0.9302 > 0.9, we willflag this email as spam.
What about formula for more than two keywords?
Advantage: it can be trained on a per-user basis. • A scientist who is researching on Viagra won’thave emails containing the word “Viagra” flagged asspam, because “Viagra” will show up often in hisgood emails.
Disadvantage:• Assume that keywords are independent. • Can’t filter image.
Bayesian spam filtering for multiple keywords
D E R I V A T I V E S A N D S T R U C T U R E D P R O D U C T S Calculation and Paying Agent Basler Kantonalbank, Basle Tracker Certificate (SVSP Category 1300) 250'000 Certificates (subject to further issues) An investor will receive as per the Redemption Date a cash amount in the Specified Currency per Certificate equal to 100% of The Final Fixing per Certificate corresponds to the proceeds
Douglas R. Adler, M.D. Ronald A. Bloom, M.D. Kenneth D. Chi, M.D. Ruven Levitan, M.D. Nina H. Merel, M.D. Alan B. Shapiro, M.D. 847-677-1170 Procedure Scheduler Ext. 17 --- Nurse Line Ext. 51 Please read ALL instructions before your colonoscopy examination and MARK CALENDAR of anychanges you need to make. Please call with any problems or questions. Obtain one of the