Project Component A


The first step of your project is to find a data set that interests you. Some students may gravitate towards political sentiment, mental health, or science. It is up to you to decide what data set (among the ones we have provided) will spark the most interesting questions for you. What makes a data set interesting to you may be the general topics that are covered or specific questions/topics that are covered within the data set.Next, try to narrow down particular parts of the data set you find interesting. Your goal is to brainstorm some possible research questions. One of the simplest research questions that can be asked is whether two constructs are associated.

For example, in GSS participants were asked about their happiness in marriage (rated Very Happy, Somewhat Happy, or Not Too Happy) and also about their self-assessment of their own health (Excellent, Good, Fair, or Poor). One question might be: Is there a relationship between health and relationship happiness?

As another example, in NESARC participants were asked about whether their blood/natural father was depressed (No, Yes, or Unknown), whether their blood/natural mother was depressed (No, Yes, or Unknown) and also a multitude of questions about their own mental health (A list of questions that reveal some form of depression). One question could be: Does parental depression predict an individual’s level of depression?

Remember that you can tweak your question as we move forward, but it will benefit you greatly to spend time now deciding a direction for your project.

The requirement of this assignment is to: select a dataset (also include information about why it is interesting to you), discuss potential research questions, and copy and paste the relevant components of the codebook into a document. (This will help you keep organized in the coming weeks, it is likely you will need to update it as your project and research question evolve). If applicable, let us know whether you are having trouble picking a topic or have any other concerns about how to move forward.

Sample Submission:

After looking through the codebook for the NESARC study, I have decided that I am particularly interested in studying family background and depression. Examining mental health and trying to understand contributing factors to depression is something that I explored during my summer internship. While the internship focused on activity level and depression, I was always interested how parental figure’s own depression (either biologically or through interactions) may contribute to their child’s own depression in adulthood.

My personal codebook includes all questions in the NESARC study that give me information about mother and father depression and also includes some signs of an individual’s own depression:

4126 1. Yes
32192 2. No
6775 9. Unknown


7134 1. Yes
31448 2. No
4511 9. Unknown
2399 1. Yes
39510 2. No
1244 9. Unknown
1292 1. Yes
1013 2. No
34 9. Unknown
40754 BL – never had 2+ years of low mood
805 1. Yes
1505 2. No
29 9. Unknown
40754 BL – never had 2+ years of low mood
1613 1. Yes
701 2. No
25 9. Unknown
40754 BL – never had 2+ years of low mood
1177 1. Yes
1134 2. No
28 9. Unknown
40754 BL – never had 2+ years of low mood

Please note you can have as few as two codebook items in your personal codebook or multiple pages of your personal codebook.

Project Component B
Directions: It is important that you examine the existing literature on your topic or association of interest. This will allow you to understand what researchers have already studied on your topic. Your ultimate objective is to go beyond what is already known through your project in this course. In order to achieve this you must familiarize yourself with what researchers have studied.The requirement of this assignment is to: Describe the association or topic that you have decided to examine and key words you found helpful in your search. List at least 5 of the most appropriate references that you have found and read.  Describe findings and interesting themes that you have uncovered and list a tentative research question or two that you hope to pursue.  Be brief and use bullets. For this assignment you must use Endnote or an alternative such as Zotero or EasyBib at Word also has a built-in citation/bibliography formatter. Use APA citations and formatting for your submission. Sample Submission:

Given the association that I have decided to examine, I use such keywords as nicotine dependencetobacco dependence and smoking. After reading through several titles and abstracts, I notice that there has been relatively little attention in the research literature to the association between smoking exposure and nicotine dependence. I expand a bit to include other substance use that provides relevant background as well.


Caraballo, R. S., Novak, S. P., & Asman, K. (2009). Linking quantity and frequency profiles of cigarette smoking to the presence of nicotine dependence symptoms among adolescent smokers: Findings from the 2004 National Youth Tobacco Survey. Nicotine & Tobacco Research, 11(1), 49-57.

Chen, K., Kandel, D.,(2002). Relationship between extent of cocaine use and dependence among adolescents and adults in the United States. Drug & Alcohol Dependence. 68, 65-85.

Chen, K., Kandel, D. B., Davies, M. (1997). Relationships between frequency and quantity of marijuana use and last year proxy dependence among adolescents and adults in the United States. Drug & Alcohol Dependence. 46, 53-67.

Dierker, L., He, J. P., Kalaydjian, A., Swendsen, J., Degenhardt, L., Glantz, M., Merikangas, K. (2008). The importance of timing of transitions for risk of regular smoking and nicotine dependence. Annals of Behavioral Medicine, 36(1), 87-92.

Dierker, L. C., Donny, E., Tiffany, S., Colby, S. M., Perrine, N., Clayton, R. R., & Network, T. (2007). The association between cigarette smoking and DSMIV nicotine dependence among first year college students. Drug and Alcohol Dependence, 86(2-3), 106-114.

Lessov-Schlaggar, C. N., Hops, H., Brigham, J., Hudmon, K. S., Andrews, J. A., Tildesley, E., . . . Swan, G. E. (2008). Adolescent smoking trajectories and nicotine dependence. Nicotine & Tobacco Research, 10(2), 341-351.

Riggs, N. R., Chou, C. P., Li, C. Y., & Pentz, M. A. (2007). Adolescent to emerging adulthood smoking trajectories: When do smoking trajectoriesdiverge, and do they predict early adulthood nicotine dependence? Nicotine & Tobacco Research, 9(11), 1147-1154.

Van De Ven, M. O. M., Greenwood, P. A., Engels, R., Olsson, C. A., & Patton, G. C. (2010). Patterns of adolescent smoking and later nicotine dependence in young adults: A 10-year prospective study. Public Health, 124(2), 65-70. Based on my reading of the above articles as well as others, I have noted a few common and interesting themes:

  1. While it is true that smoking exposure is a necessary requirement for nicotine dependence, frequency and quantity of smoking are markedly imperfect indices for determining an individual’s probability of exhibiting nicotine dependence (this is true for other drugs as well)
  2. The association may differ based on ethnicity, age, and gender (although there is little work on this)
  3. One of the most potent risk factors consistently implicated in the etiology of smoking behavior and nicotine dependence is depression.

I have decided to further focus my question by examining whether the association between nicotine dependence and smoking differs based on whether a person is experiencing depression. I am wondering if at low levels of smoking compared to high levels, nicotine dependence is more common among individuals with major depression than those without major depression. I add relevant depression questions/items/variables to my personal codebook as well as several demographic measures (age, gender, ethnicity, etc.) and any other variables I may wish to consider.

Project Component C
Directions:You will continue to frame your topic and research question and outline your research intentions. in preparation of writing your research plan paper, you will work on organizing your research by outlining your ideas.For this assignment you are expected to think about a title and the 3 sections that will become part of your paper: The Introduction, Methods, and Predicted Results/Implications. The details below explain what is expected in each of those sections. For your submission, you are expected only to make an outline of each of the sections below:

  • Title: Your title should summarize the main idea of your research question and should include the variables under investigation. The title should be fully explanatory when standing alone.
  • Introduction (Literature Review): Your introduction should describe your topic and rationale for your research question. Your objective is to convince the reader why they should care about the topic and frame how you are going to contribute to the literature. Your introduction should:
    • Include an opening statement about your main topic.
    • Describe what is known in the literature about your topic or association. (You should have at least 3 main points to discuss here)
    • Justify your research
    • Use specific examples and describe major findings.
    • Describe what is not known about your topic.
    • Summarize any gaps found in the literature and describe how your analyses contribute to filling this gap
    • Your research question
  • Methods: 
    • Name data set and at least 3 key features of the sample or way data were collected.
    • Describe your measures
      • What type of variables are you using? Explain if you are combining several measures.
  • Predicted Results/Implications
    • What do you expect that your research will reveal?
    • Why would these findings be important? Could anything actionable happen as a result of your findings?

Sample Submission:

  • Title: The Association Between Nicotine Dependence and Major Depression
  • Introduction
    • Major depression is a major risk factor of the development of nicotine dependence
    • Depression has been shown to increase risk of later smoking. This temporal ordering suggests the possibility of a causal relationship.
    • Research shows major depression increases the probability and amount of smoking
    • A substantial number of individuals reporting daily and/or heavy smoking do not meet criteria for nicotine dependence. (Kandel & Chen, 2000)
    • It is unclear whether those with major depression experience nicotine dependence beyond what would be expected by smoking exposure alone.
    • Is there a relationship between major depression and nicotine dependence? Does the relationship between nicotine dependence and major depression exists above and beyond smoking exposure?
  • Methods
    • NESARC
      • The sample from the first wave of the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) represents the civilian, non-institutionalized adult population of the United States
      • The NESARC included over sampling of Blacks, Hispanics and young adults aged 18 to 24 years.
      • Face-to-face computer assisted interviews were conducted in respondents’ homes following informed consent procedures.
      • The sample included 43,093 participants.
    • Measures
      • Nicotine Dependence: Using the tobacco module, the criteria for nicotine dependence was assessed.
      • Nicotine Use: “About how often did you usually smoke in the past year?”) coded dichotomously in terms of the 3 presence or absence of daily smoking and quantity (“On the days that you smoked in the last year, about how many cigarettes did you usually smoke?”). These questions were combined together to determine approximately how many cigarettes were smoked per month.
      • Major Depression: Lifetime major depression (i.e. those experienced in the past 12 months and prior to the past 12 months) were assessed using the NIAAA, Alcohol Use Disorder and Associated Disabilities Interview Schedule – DSM-IV
  • Predicted Results/Implications
    • It is understood that nicotine use predicts nicotine dependence.
    • It is not yet clear whether major depression will predict nicotine dependence after controlling for nicotine use.
    • If individuals with major depression are more sensitive to the development of nicotine dependence, they would represent an important population subgroup for targeted smoking intervention programs.

Please also review the Model Research Plan that you should be working on extensively outside of class.

Project Component D
Directions: The requirement of this assignment is to: Call in the appropriate dataset, select the columns (i.e. variables), and possibly rows (i.e. observations), of interest, and run frequency distributions for your chosen variables.You should include:

  1. Your program.
  2. The output that displays three of your variables in frequency tables.
  3. A few sentences describing the results of your frequency tables.

Project Component E
Data manage at least 3 of your variables and create new frequency tables of your data managed variables.
At a minimum, your data management should include managing missing data, but you should also consider including other data management decisions (such as collapsing responses, creating a quantitative score, etc). You may also want to consider renaming your variables to something more intuitive.

Project Component F
Directions: There are a variety of conventional ways to visualize data – tables, histograms, bar graphs, etc. Now that your data have been managed, it is time to graph your variables one at a time and examine both center and spread. Include your univariate graphs of your two main constructs (i.e. data managed variables). Write a few sentences describing what your graphs reveal in terms of shape, spread, and center (if variable is quantitative) and most/least likely categories if variable is categorical.

Project Component G

Directions: (1) Construct a graph that shows the association between your explanatory and response variables (bivariate graph). Write a few sentences describing what your graphs reveal in terms of the relationships among the variables. How does this correspond with your predictions? Does the graph reveal anything unexpected or interesting about your relationship of interest? (2) OPTIONAL: Construct a 2nd graph that shows the association between another explanatory variable and your response variable. Again, write a few sentences describing what your graph reveals in terms of the relationships among the variables.

Project Component H

Directions: Determine what the appropriate statistical test is for your main two variables of interest. Your options are:

  • Analysis of variance (ANOVA) assesses whether the means of two or more groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means (quantitative variables) of groups (categorical variables). The null hypothesis is that there is no difference in the mean of the quantitative variable across groups (categorical variable), while the alternative is that there is a difference.
  • A Chi-Square Test of Independence compares frequencies of one categorical variable for different values of a second categorical variable. The null hypothesis is that the relative proportions of one variable are independent of the second variable; in other words, the proportions of one variable are the same for different values of the second variable. The alternate hypothesis is that the relative proportions of one variable are associated with the second variable. Note: although it is possible to run large Chi-Square tables (e.g. 5 x 5, 4 x 6, etc.), the test is really only interpretable when you response variable has 2 levels (see Graphing decisions flow chart in bivariate graphing chapter).
  • Correlation coefficient assesses the degree of linear relationship between two variables. It ranges from +1 to -1. A correlation of +1 means that there is a perfect, positive, linear relationship between the two variables. A correlation of -1 means there is a perfect, negative linear relationship between the two variables. In both cases, knowing the value of one variable, you can perfectly predict the value of the second. Note: Two 3+ level categorical variables can be used to generate a correlation coefficient if the the categories are ordered and the average (i.e. mean) can be interpreted. The scatter plot on the other hand will not be useful. In general the scatterplot is not useful for discrete variables (i.e. those that take on a limited number of values). When we square r, it tells us the proportion of the variability in one variable that is described by variation in the second variable (aka RSquare or Coefficient of Determination).
  • Please note: If you have a quantitative explanatory variable and a categorical response, you will eventually be using logistic regression. For now, categorize your explanatory variable and use a chi-square test as explained above.

The requirement of this assignment is to: Run the appropriate test, post the syntax used, and interpret your findings. In addition, use post-hoc tests if appropriate. Please see the samples below for guidance in writing statistical findings.

Sample Submission: 

  • Example of how to write results for ANOVA:
    • When examining the association between current number of cigarettes smoked (quantitative response) and past year nicotine dependence (categorical explanatory), an Analysis of Variance (ANOVA) revealed that among daily, young adult smokers (my sample), those with nicotine dependence reported smoking significantly more cigarettes per day (Mean=14.6, s.d. ±9.15) compared to those without nicotine dependence (Mean=11.4, s.d. ±7.43), F(1, 1313)=44.68, p=.0001.
    • Post hoc ANOVA results: ANOVA revealed that among daily, young adult smokers (my sample), number of cigarettes smoked per day (collapsed into 5 ordered categories, which is the categorical explanatory variable) and number of nicotine dependence symptoms (quantitative response variable) were significantly associated, F (4, 1308)=11.79, p=.0001. Post hoc comparisons of mean number of nicotine dependence symptoms by pairs of cigarettes per day categories revealed that those individuals smoking more than 10 cigarettes per day (i.e. 11 to 15, 16 to 20 and >20) reported significantly more nicotine dependence symptoms compared to those smoking 10 or fewer cigarettes per day (i.e. 1 to 5 and 6 to 10). All other comparisons were statistically similar.
  • Chi-Square Test of Independence
    • When examining the association between lifetime major depression (categorical response) and past year nicotine dependence (categorical explanatory), a chi-square test of independence revealed that among daily, young adults smokers (my sample), those with past year nicotine dependence were more likely to have experienced major depression in their lifetime (36.2%) compared to those without past year nicotine dependence (12.7%), X2 =88.60, 1 df, p=0001.
    • Post hoc Chi-Square results: A Chi Square test of independence revealed that among daily, young adult smokers (my sample), number of cigarettes smoked per day (collapsed into 5 ordered categories) and past year nicotine dependence (binary categorical variable) were significantly associated, X2 =45.16, 4 df, p=.0001. Post hoc comparisons of rates of nicotine dependence by pairs of cigarettes per day categories revealed that higher rates of nicotine dependence were seen among those smoking more cigarettes, up to 11 to 15 cigarettes per day. In comparison, prevalence of nicotine dependence was statistically similar among those groups smoking 10 to 15, 16 to 20, and > 20 cigarettes per day.
    • For an example when response variable is categorical with more than 2 levels, please refer to document (Additional Sample Result Write – Chi Square).
  • Correlation
    • Among daily, young adult smokers (my sample), the correlation between number of cigarettes smoked per day (quantitative) and number of nicotine dependence symptoms experienced in the past year (quantitative) was 0.17 (p=.0001), suggesting that only 3% (i.e. 0.17 squared) of the variance in number of current nicotine dependence symptoms can be explained by number of cigarettes smoked per day.

Project Component I

Directions: Revisit the graphing material and construct a graph that utilizes your main explanatory and response variable and an additional 3rd variable. Describe your findings.

Project Component J

Directions: Run the appropriate regression (linear regression or logistic regression) using only your main explanatory and response variable. You should compare the results of this regression to your findings in Project Component H.

Sample Submission:

  • Example of how to write results for Simple Regression: Major depression (Beta=1.34, CI 1.29-1.39, p=.0001) was significantly and positively associated with number of nicotine dependence symptoms. On average, someone with major depression is expected to have 1.34 additional symptoms more than someone without major depression.
  • Example of how to write results for Logistic Regression: Major depression (O.R. 4.0, CI 2.94-5.37) was significantly and positively associated with the likelihood of meeting criteria for nicotine dependence.

Project Component K

Directions: Clean up your code! For this project component you will submit your project code up until this point into your blog. All of your code should be compiled into a single file.

  • R students should submit a “.R” file
  • Stata students should submit a “.do” file
  • SAS students should submit a “.sas” file

Please see Sample Submissions (in moodle) for assistance on how your code should be organized into a single file.



Project Component L

Directions: Run the appropriate regression (linear regression or logistic regression) using your main explanatory and response variable and additional covariates or potential confounding variables of interest. Describe your findings.

Project Component M

Directions:For our final poster session you will be creating your a wordpress page. For this project component you will:

1) Visit and log-in using your Wesleyan username (does not include Click on “QAC 201 Applied Data Analysis” in upper left corner. Then, find the link: “Instructions for Authors/Participants” and then click on “How to Create a WordPress page”.

2) Follow the instructions to create your own word press page – include the following information:

    • Name
    • Biography
    • Picture
    • Project Title
    • Zoom link (create a zoom link under your Wesleyan account that can be accessible by anyone with link – do not include a waiting room or any specific date/time). You will be using this link for your final presentation.
    • (For now, you will not be including a poster – you will add this in later).