Mini-Assignment 10 – Applied Data Analysis

Directions

Familiarize yourself with the codebook for the movies dataset below and then import/load the dataset.

Question 1: Suppose you want to determine whether movie budget is significantly associated with the movie rating. Construct the appropriate regression. Is budget significantly associated with movie rating? Interpret the term in the model that describes the relationship.

Question 2: Suppose you want to determine whether movie budget is significantly associated with movie rating after controlling for MPAA-designation. Construct the appropriate regression. Is budget significantly associated with movie rating when controlling for MPAA-designation?

Question 3: Using your model from the previous question, are NC-17 movies rated significantly differently than R-rated movies when controlling for budget?

Question 4: There is reason to believe that the relationship between budget and viewer rating may vary differently based on whether the movie is a Comedy. Construct an appropriate graph that allows you to assess this theory. Does it visually appear that Comedy-status moderates the relationship between budget and viewer rating?

Question 5: Create a subset that includes only Comedies. With this subset, construct a model that determines whether there is a relationship between budget and viewer rating. Among comedies, is there a relationship between budget and viewer rating?

Question 6: Create a subset that includes only non-Comedies. With this subset, construct a model that determines whether there is a relationship between budget and viewer rating. Among non-comedies, is there a relationship between budget and viewer rating?

Question 7: Does the relationship between budget and viewer rating vary based on whether a movie is a comedy?

Question 8: Construct a model using the whole data set which assess whether the relationship between budget and viewer rating varies significantly based on whether a movie is a Comedy. Does the relationship between budget and viewer rating significantly vary based on whether a movie is a comedy according to your one regression equation?

Familiarize yourself with the codebook for the nhanes dataset below and then import/load the dataset.

Question 9: While flawed and not necessarily an accurate description of health for many body types, the classifications below are still often used by the medical community to categorize BMIs . Using the table below, construct a new variable “BMI category” based on the quantitative BMI variable in the data set. The categories of your new variable should be:

” (2) Below Target”: for BMI’s under 18.5
“(1) Target”: for BMI’s 18.5-24.9
” (3) Above Target”: for BMI’s 25-29.9
” (4) Obese”: for BMI’s 30-39.9
” (5) Morbidly obese”: for BMI’s greater than 40

Suppose you want to determine whether BMI category(the explanatory variable) is associated with the likelihood of having diabetes (the response variable). Construct the appropriate visualization to help you assess this relationship. Describe the visual relationship between BMI categorization and diabetes.

Question 10: Suppose you want to determine whether BMI category is significantly associated with diabetes. Construct the appropriate regression. The model estimates that the odds of having diabetes are ______ times higher for those who are morbidly obese compared to those in the “target” zone.

Question 11: Now suppose you want to determine whether the quantitative version of BMI is significantly and positively associated with the likelihood of diabetes. Construct the appropriate regression. Describe the relationship between BMI and diabetes.

Question 12: There is reason to believe that the relationship between BMI and diabetes varies based on gender. Construct the appropriate regression that allows you to address this question. Does the relationship between BMI and diabetes vary significantly based on gender?

CODEBOOK: Movies Data

The internet movie database, http://imdb.com/, is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by amazon. More about information imdb.com can be found online,http://imdb.com/help/show_leaf?about, including information about the data collection process,http://imdb.com/help/show_leaf?infosource.

The description of the data is as follows:

title. Title of the movie.
year. Year of release.
budget_millions. Total budget (if known) in US dollars
length. Length in minutes.
rating. Average IMDB user rating.
votes. Number of IMDB users who rated this movie.
r1-10. Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 1.
mpaa. MPAA designation.
Action, Animation, Comedy, Drama, Documentary, Romance, Short. Binary variables representing if movie was classified as belonging to that genre.

CODEBOOK: NHANES

This is survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys over the years.

The variables in this data set include:

Variable Name	Description
ID	Unique identifier
Gender	male/female
Age	Age of participant (in years)
Race	Black, Hispanic, Mexican, White, Other
Education	Highest level of education
MaritalStatus	Divorced, Never Married, Married, Separated, Widowed
Testosterone	Testerone total (ng/dL).
GeneralHealth	Self-reported overall health (Poor, Fair, Good, Vgood, Excellent)
BMI	Body mass index
PhysActiveDays	Number of days in a typical week that participant does vigorous-intensity activity.
Diabetes	Indicates whether or not someone has diabetes (1)=diabetes, (0)=no diabetes
BP_Sys_Reading1	Systolic blood pressure (mm Hg) at beginning of appointment
BP_Sys_Reading2	Systolic blood pressure (mm Hg) at end of appointment
Work	Indicates whether or not someone is working (1)=working, (0)=not working
TotChol	Total HDL cholesterol (mmol/L)