Data Management

Pre-Class Readings and Videos

Examining frequency distributions for each of your variables is the key to further guiding the decision making involved in quantitative research.

EXAMPLE:

A random sample of 1,200 U.S. college students were asked the following questions as part of a larger survey: “What is your perception of your own body? Do you feel that you are overweight, underweight, or about right?” The following table shows part of the data (5 of the 1200 observations);

STUDENT BODY IMAGE
Student 25 Overweight
Student 26 About Right
Student 27 Underweight
Student 28 About Right
Student 29 About Right

Here is some information that would be interesting to get from these data:

  • What percentage of the sampled students fall into each category?
  • How are students divided across the three body image categories?

Are they equally divided? If not, do the percentages follow some other kind of pattern?

There is no way that we can answer these questions by looking at the raw data, which are in the form of a long list of 1,200 responses and thus not very useful. However, both these questions will be easily answered once we summarize and look at the frequency distribution of the variable BodyImage (i.e., once we summarize how often each of the categories occurs).

In order to summarize the distribution of a categorical variable, we ask our statistical software program to create a table of the different values (categories) the variable takes, how many times each value occurs (count), and, more importantly, how often each value occurs (percentages). Here is the table (i.e. frequency distribution) for our example:

Table 7.1: Body Image Distribution
CATEGORY COUNT PERCENTAGE
About Right 855 71.3%
Overweight 235 19.6%
Underweight 110 9.2%
Total 1200 100%

Please watch the video below.

Find Stata video here.

Data Management 

During the class session, we will begin to work through how to make decisions about data management and how to put those decisions into action.

An understanding of basic operations to be used with your statistical software is a good place to start.

Examples of data management decisions:

1. Need to identify missing data

Often, you must define the response categories that represent missing data. For example, if the number 9 is used to represent a missing value, you must either designate in your program that this value represents missingness or else you must recode the variable into a missing data character that your statistical software recognizes. If you do not, the 9 will be treated as a real/meaningful value and will be included in each of your analyses.

2. Need to recode responses to “no” based on skip patterns

There are a number of skip outs in some data sets. For example, if we ask someone whether or not they have ever used marijuana, and they say “no”, it would not make sense to ask them more detailed questions about their marijuana use (e.g. quantity, frequency, onset, impairment, etc.). When analyzing more detailed questions regarding marijuana (e.g. have you ever smoked marijuana daily for a month or more?), those individuals that never used the substance may show up as missing data. Since they have never used marijuana, we can assume that their answer is “no”, they have never smoked marijuana daily. This would need to be explicitly recoded. Note that we commonly code a no as 0 and a yes as 1.

3. Need to collapse response categories

If a variable has many response categories, it can be difficult to interpret the statistical analyses in which it is used. Alternatively, there may be too few subjects or observations identified by one or more response categories to allow for a successful analysis. In these cases, you would need to collapse across categories. Consider the variable S1Q6A from the data frame NESARC which has 14 levels that record the highest level of education of the participant. To collapse the categories into a dichotomous variable that indicates the presence of a high school degree, use the ifelse function. The levels 123456, and 7 of the variable S1Q6A correspond to education levels less than completing high school.

4. Need to aggregate variables

In many cases, you will want to combine multiple variables into one. Consider creating create a new variable DepressLife which is Yes if the variable MAJORLIFE or DYSLIFE is a 1 (data frame NESARC).

5. Need to create continuous variables

If you are working with a number of items that represent a single construct, it may be useful to create a composite variable/score. For example, I want to use a list of nicotine dependence symptoms meant to address the presence or absence of nicotine dependence (e.g. tolerance, withdrawal, craving, etc.). Rather than using a dichotomous variable (i.e. nicotine dependence present/absent), I want to examine the construct as a dimensional scale (i.e. number of nicotine dependence symptoms). In this case, I would want to recode each symptom variable so that yes=1 and no=0 and then sum the items so that they represent one composite score.

6. Labeling variable responses/values

Given that nominal and ordinal variables have, or are given numeric response values (i.e. dummy codes), it can be useful to label those values so that the labels are displayed in your output.

7. Need to further subset the sample

When using large data sets, it is often necessary to subset the data so that you are including only those observations that can assist in answering your particular research question. In these cases, you may want to select your own sample from within the survey’s sampling frame. For example, if you are interested in identifying demographic predictors of depression among Type II diabetes patients, you would plan to subset the data to subjects endorsing Type II Diabetes.

NOTE: Often, you will need to create groups or sub-samples from the data set for the purpose of making comparisons. It is important to be certain that the groups that you would like to compare are of adequate size and number. For example, if you were interested in comparing complications of depression in parents who had lost a child through miscarriage vs. parents who had lost a child in the first year of life, it would be important to have large enough groups of each. It would not be appropriate to attempt to compare 5000 observations in the miscarriage group to only 9 observations in the first year group.

Pre-Class Quiz

After reviewing the material above, Quiz 4 (click here). Please note that you have 2 attempts for this quiz and the higher grade prevails.

During Class Tasks

Mini-Assignment 3
Project Component E