- Familiarize yourself with the codebook for the Email dataset below.
- Import/load the Email dataset.
- In this assignment we will focus on predicting whether an email is spam or not. Determine the number of emails that make up this data set and the proportion of the emails that were spam. Answer Question 1 below.
- We will start with a simple spam filter that will only use a single predictor “attachment” to classify a message as spam or not. Determine what proportion of emails with an attachment are spam and what proportion of emails without an attachment are spam. Answer Question 2 below.
- Construct a chi-square test to determine whether there is an association between spam email and whether an email contained an attachment. Answer Question 3 below.
- Now, fit an appropriate regression model between spam and attachment. Answer Question 4 below.
- Find the odds ratios of model coefficients. Answer Question 5 below.
- Construct an appropriate model that can be used to determine whether the association between whether an email is spam and whether it has an attachment is significant after controlling for the number of characters . Answer Question 6 below.
- Find the odds ratios of model coefficients. Answer Question 7 below.
- Construct an appropriate model with our response variable (spam) and explanatory variables (attachment, number of characters, and whether there is exclaimation point in the subject). Answer Question 8 below.
Question 1: How many emails make up this data set? What percent of the emails were spam?
Question 2: ____% of emails without an attachment were classified as spam, whereas _____ % of emails with an attachment were classified as spam.
Question 3: State the correspondig test statistic and p-value to test this association. What is your conclusion?
Question 4: What type of regression model is appropriate? Why?
Question 5: Interpret the odds ratio to compare emails with attachments to emails without attachments.
Question 6: Is there a significant association between whether an email is spam and whether the email has an attachment after controlling for number of characters of an email message has? What is the associated p-value and conculsion?
Question 7: What is the correct interpretation of the odds ratio corresponding to characters.
Question 8: Controlling for all other predictor variables, is whether a message is spam independently associated with whether there is an exclaimation point in the subject? Why?
Submit your answers here.
CODEBOOK: E-mail Data
Today we will be working with a corpus of emails received by a single gmail account over the first three months of 2012. Just like any other email address this account received and sent regular emails as well as receiving a large amount of spam, unsolicited bulk email. We will be using what we have learned about logistic regression models to see if we can build a model that is able to predict whether or not a message is spam based on a variety of characteristics of the email (e.g. inclusion of words like winner, inherit, or password, the number of exclamation marks used, etc.) While the spam filters used by large coorportations like Google and Microsoft are quite a bit more complex the fundamental idea is the same – binary classification based on a set of predictors.
The description of the data is as follows:
- spam Indicator for whether the email was spam.
- tomultiple Indicator for whether the email was addressed to more than one recipient.
- from Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
- cc Indicator for whether anyone was CCed
- sent_email Indicator for whether the sender had been sent an email in the last 30 days
- image Indicates whether any images were attached.
- attach Indicates whether any files were attached
- dollar Indicates whether a dollar sign or the word ‘dollar’ appeared in the email
- winner Indicates whether “winner” appeared in the email
- inherent Indicates whether “inherit” (or an extension, such as inheritance) appeared in the email.
- password Indicates whether “password” appeared in the email.
- num_char The number of characters in the email, in thousands.
- line_breaks The number of line breaks in the email (does not count text wrapping).
- format Indicates whether the email was written using HTML (e.g. may have included bolding or active links) or plaintext.
- re_subj Indicates whether the subject started with “Re:”, “RE:”, “re:”, or “rE”
- exclaim_subj Indicates whether there was an exclamation point in the subject.
- urgent_subj Indicates whether the word “urgent” was in the email subject.
- exclaim_mess The number of exclamation points in the email message.
- number Factor variable saying whether there was no number, a small number (under 1 million), or a big number.