Translation Syntax

Loading the Dataset

Python

import pandas as pd
import numpy as np
myData = pd.read_csv('/path/to/file.csv')
SPSS

GET FILE="/path/to/file.sav".
STATA

use "/path/to/file.dta"
SAS
/*To load in a file with extension .sas7bdat*/
LIBNAME myFolder "P:\QAC\QAC201\Studies and Codebooks\StudyName\Data"; data new; set myFolder.filename;

/*To import a file with extension .csv*/

proc import datafile="P:\QAC\qac201\PracticeData\data_name.csv"  out=data_name
 dbms=csv
 replace;
 getnames=yes;
run; 
R
# If extension is .Rdata
load("/path/to/file.Rdata") # If extension is csv
myData <- read.csv("/path/to/file.csv")

Selecting Variables

Python

myData=myData_orig[['VAR1','VAR2','VAR3','VAR4','VAR5','VAR6','VAR7','VAR8']]
SPSS

* put this as a subordinate of the SAVE OUTFILE command; the outfile will only contain that specified variables.
/KEEP VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8.
STATA

use VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8 /// 
   using "P:\QAC\qac201\Studies and Codebooks\StudyName\Data\filename", clear
SAS

* put this code inside a data step;
KEEP VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8; 
R

var.keep <- c("VAR1", "VAR2", "VAR3", "VAR4", "VAR5", "VAR6", "VAR7", "VAR8")
myData <- myData_orig[ ,var.keep]

Saving the Dataset

Python
#pd short for pandas
pd.DataFrame.to_csv(myData,‘filename.csv’)
SPSS

SAVE OUTFILE= "/path/to/outfile.sav".
STATA

save "/path/to/outfile.dta"
SAS

LIBNAME myFolder "/path/to/outFolder";
data myFolder.outfile; set tempfile; by unique_id; 
R

# write as csv
write.table(myData, file = "/path/to/outfile.csv", sep = ",", row.names = FALSE)

# write as .RData
save(file = "/path/to/outfile.RData", myData)

Sorting the Data

Python

myData=myData.sort_values(by='unique_id')
SPSS

SORT CASES BY var.
STATA

sort var
SAS

proc sort; by var; 
R

myData <- myData[order(myData$var, decreasing = FALSE), ]

Data Management

Logical Operators

Python == >= <= > < != or <>
SPSS EQ or = >= or GE <= or LE > or GT < or LT NE
STATA == >= <= > < !=
SAS EQ or = >= or GE <= or LE > or GT < or LT != or NE
R == >= <= > < !=

Selecting Observations
When using large data sets, it is often necessary to subset the data so that you are including only those observations that can assist in answering your particular research question. In these cases, you may want to select your own sample from within the survey’s sampling frame. For example, if you are interested in identifying demographic predictors of depression among Type II diabetes patients, you would plan to subset the data to subjects endorsing Type II Diabetes.

Python
title_of_subsetted_data=myData[myData.diabetes2==1]

#Or you can use the ['variable'] to access variables

title_of_subsetted_data = myData[myData['diabetes2']==1]
SPSS

*must be added as a command option.
/SELECT=diabetes2 EQ 1 
STATA

// create a subset from the data
keep if (diabetes2==1)

// if running a procedure on a subset of the data (format: procedure [arguments] if [condition]). for example, if you want to run a frequency table on bio_sex for participants with type II diabetes

tab bio_sex if diabetes2==1
SAS

* inside the data step;
if diabetes2=1; 
R

# create a subset of the data 
myDataSubset <- myData[myData$diabetes2 == 1, ]

Missing Data
Often, you must define the response categories that represent missing data. For example, if the number 9 is used to represent a missing value, you must either designate in your program that this value represents missingness or else you must recode the variable into a missing data character that your statistical software recognizes. If you do not, the 9 will be treated as a real/meaningful value and will be included in each of your analyses.

Python
myData['VAR1']= myData['VAR1'].replace(9, numpy.nan)
SPSS

RECODE VAR1 (9=SYSMIS). 
STATA

replace VAR1=. if VAR1==9 
SAS

* inside the data step;
if VAR1=9 then VAR1=.;
R

myData$VAR1[myData$VAR1 == 9 ] <- NA

Converting String to a Dummy Coded Variable
It is important when preparing to run statistical analyses in most software packages, that all variables have response categories that are numeric rather than “string” or “character” (i.e. response categories are actual strings of characters and/or symbols). All variables with string responses must therefore be recoded into numeric values. These numeric values are known as dummy codes in that they carry no direct numeric meaning.

Python
#method 1. define a function 

def TREE_N(row):
if row['TREE']=='Maple' :
return 1 
if row['TREE']=='Oak' :
return 2

myData['TREE_N']=myData.apply(TREE_N, axis = 1) # axis=1 means apply to each row



#method 2. Alternatively, you can use the loc function

myData.loc[myData['TREE'] == 'Maple', 'NewVariable'] = 1
myData.loc[myData['TREE'] == 'Oak', 'NewVariable'] = 0


#method 3. use apply

myData['NewVariable'] = myData['TREE'].apply(lambda x: 1 if x == 'Maple' else (0 if x == 'Oak' else None))
SPSS

RECODE TREE ('Maple'=1) ('Oak'=2) INTO TREE_N.
STATA

generate TREE_N=.
replace TREE_N=1 if TREE=="Maple"
replace TREE_N=2 if TREE=="Oak"

// OR
encode TREE, gen(TREE_N) 
SAS

* inside the data step;
if TREE='Maple' then TREE_N=1;
else if TREE= 'Oak' then TREE_N=2;
R
myData$NewVariable[myData$TREE=="Maple"]<-1
myData$NewVariable[myData$TREE=="Oak"]<-0

Collapsing Responses within a Categorical Variable
If a variable has many response categories, it can be difficult to interpret the statistical analyses in which it is used. Alternatively, there may be too few subjects or observations identified by one or more response categories to allow for a successful analysis. In these cases, you would need to collapse across categories. For example, if you have the following categories for geographic region, you may want to collapse some of these categories:

Region: New England=1, Middle Atlantic=2, East North Central=3, West North Central=4, South Atlantic=5, East South Central=6, West South Central=7, Mountain=8, Pacific=9.

New_Region: East=1, West=2.

Python
# change levels

def new_region(row):
if row['region']== 1 or row['region']== 2 or row['region']== 3 or
row['region']==5 or row['region']==6:
return 1
elif row['region']==4 or row['region']==7 or row['region']==8 or
row['region']==9:
return 2

myData['new_region']=myData.apply(new_region(row), axis = 1)


#Or you can do it faster

def new_region(row):
if row['region'] in (1,2,3,5,6):
return 1
elif row['region'] in (4,7,8,9):
return 2

myData['new_region']=myData.apply(new_region(row), axis = 1)
SPSS

COMPUTE new_region=2. 
IF (region=1|region=2|region=3|region=5|region=6) new_region=1.
STATA

generate new_region =2
replace new_region=1 if region==1|region==2|region==3|region==5|region==6 

// OR
recode region (1/3 5 6=2), gen(new_region)
SAS

* inside the data step;
if region=1 or region=2 or region=3 or region=5 or region=6 then new_region=1;
else if region=4 or region=7 or region=8 or region=9 then new_region=2;
R
# The below says, make a new variable called "new_region"
# and set it equal to "East" if the original variable region is
# either 1 or 2 or 3 or 5.
myData$new_region[myData$region == 1|myData$region == 2|myData$region == 3|myData$region == 5|myData$region] <- "East"

# Set "new_region" equal to "West" if the original variable region is
# either 7 or 8 or 9. myData$new_region[myData$region == 4|myData$region == 7|myData$region == 8|myData$region == 9] <- "West"

Collapsing Responses within a Quantitative Variable
Suppose we know the year participants were born and we would like to make a new variable that defines the generation they belong to.

Year: Year participant was born ranging from 1965 to 2023 (this is the existing variable)

Generation: 1=Generation Alpha (someone born after 2013, 2=Generation Z (someone born between 1997 and 2012), 3=Millenial (someone born between 1981 and 1996), 4=Generation X (someone born between 1965 to 1980).

Python

# import pandas as pd 
myData.loc[myData['year'] >= 2013, 'generation'] = 1
myData.loc[(myData['year'] >= 1997) & (myData['year'] <= 2012), 'generation'] = 2
myData.loc[(myData['year'] >= 1981) & (myData['year'] <= 1996), 'generation'] = 3
myData.loc[(myData['year'] >= 1965) & (myData['year'] <= 1980), 'generation'] = 4

SPSS


STATA
generate generation =. 
replace generation=1 if year >= 2013
replace generation=2 if year >= 1997 & year <=2012
replace generation=3 if year >=1981 & year <= 1996
replace generation=4 if year >= 1965 & year <= 1980
SAS

* inside the data step;
if year GE 2013 then generation=1;
if year GE 1997 and year LE 2012 then generation=2;
if year GE 1981 and year LE 1996 then generation=3;
if year GE 1965 and year LE 1980 then generation=4;
R
myData$generation[myData$year>=2013]<-1
myData$generation[myData$year>=1997 & myData$year <=2012]<-2
myData$generation[myData$year>=1981 & myData$year <= 1996]<-3
myData$generation[myData$year>=1965 & myData$year <= 1980]<-4

Collapsing Responses Across Variables
In many cases, you will want to combine multiple variables into one. For example, while NESARC assesses several individual anxiety disorders, I may be interested in anxiety more generally. In this case I would create a general anxiety variable in which those individuals who received a diagnosis of social phobia, generalized anxiety disorder, specific phobia, panic disorder, agoraphobia, or obsessive compulsive disorder would be coded “yes” and those who were free from all of these diagnoses would be coded “no”.

Python
def anxiety(row):
if row['socphob']==1 or row['gad']== 1 or row['panic']== 1 or row['agora']== 1 or row['ocd']==1:
return 1
else:
return 0

myData['anxiety']=data.apply(lambda row:anxiety(row), axis = 1)
SPSS

IF (socphob=1|gad=1|specphob=1|panic=1|agora=1|ocd=1) anxiety=1.
RECODE anxiety (SYSMIS=0). 
STATA

gen anxiety=1 if socphob==1|gad==1|specphob==1|panic==1|agora==1|ocd==1
replace anxiety=0 if anxiety==.
SAS

* inside the data step;
if socphob=1 or gad=1 or specphob=1 or panic=1 or agora=1 or ocd=1 then anxiety=1; 
else anxiety=0;
R

# Make a new variable called "anxiety" and set it equal to 0 if
# the person has none of the anxiety symptoms (that is, if all
# variables (socphob, gad, panic, agora, ocd) are 0)
myData$anxiety[myData$socphob == 0&myData$gad==0&myData$panic == 0&myData$agora==0&myData$ocd == 0] <- 0


# Set this new variable equal to 1 if the person has a "1" for any of the
# anxiety variables.
myData$anxiety[myData$socphob == 1|myData$gad==1|myData$panic == 1|myData$agora==1|myData$ocd == 1] <- 1

Creating Index or Score
If you are working with a number of items that represent a single construct, it may be useful to create a composite variable/score. For example, I want to use a list of nicotine dependence symptoms meant to address the presence or absence of nicotine dependence (e.g. tolerance, withdrawal, craving, etc.). Rather than using a dichotomous variable (i.e. nicotine dependence present/absent), I want to examine the construct as a dimensional scale (i.e. number of nicotine dependence symptoms). In this case, I would want to recode each symptom variable so that yes=1 and no=0 and then sum the items so that they represent one composite score.

Python
myData['nd_sum']=myData['nd_symptom1']+myData['nd_symptom2']+
myData['nd_symptom3']+myData['nd_symptom4
SPSS

COMPUTE nd_sum=sum(nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4).
STATA

egen nd_sum=rsum(nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4) 
SAS

* inside the data step;
nd_sum=sum(of nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4);
R

myData$nd_sum <- myData$nd_symptom1+myData$nd_symptom2+myData$nd_symptom3+myData$nd_symptom4

Labeling Variables
Given the often cryptic names that variables are given, it can sometimes be useful to label them.

Python

N/A
SPSS

VARIABLE LABELS VAR1 'label'.
STATA

label variable VAR1 "label"
SAS

* inside the data step;
LABEL VAR1='label'; 
R

# no built-in label tags for variables

Renaming Variables
Given the often cryptic names that variables are given, it can sometimes be useful to give a variable a new name (something that is easier for you to remember or recognize).

Python
myData=myData.rename({'oldvar':'newvar'}, axis='columns')
SPSS

* no actual rename function, this will create a copy of the variable with the desired name.
COMPUTE newvarname=VAR1.
STATA

rename VAR1 newvarname
SAS

* inside the data step;
RENAME VAR1=newvarname;
R

names(myData)[names(myData)== "VAR1"] <- "newvarname"

Labeling Variable Responses/Values
Given that nominal and ordinal variables have, or are given numeric response values (i.e. dummy codes), it can be useful to label those values so that the labels are displayed in your output.

Python

#Because the function doesn't name the existing levels, make sure you have them all in the right order.

myData['VAR1']=myData['VAR1'].astype('category')
myData['VAR1']=myData['VAR1'].cat.rename_categories(["value0label","value1label",
value2label","value3label"])
SPSS

VALUE LABELS VAR1 0 'value0label' 1 'value1label' 2 'value2label' 3 'value3label'.
STATA

label define labelName 0 "value0label" 1 "value1label" 2 "value2label" 3 "value3label"
label values VAR1 labelName
SAS

* Set up format before the data step;
proc format; VALUE FORMATNAME 0="value0label" 1="value1label" 2="value2label" 3="value3label";

data myData; set myData;
   * other data management procedures;
	format VAR1 FORMATNAME.
   run;
R

# get order of the values
levels(myData$VAR1) 

# input the labels in the same order as how the values were printed above
levels(myData$VAR1) <- c("value0label", "value1label", "value2label", "value3label")

Univariate Analysis

Categorical Variables (frequency)

Python

myData['VAR1'].value_counts(normalize = True, sort=False, dropna=False)

SPSS

FREQUENCIES VARIABLES=CategVar1 CategVar2 CategVar3
   /ORDER=ANALYSIS. 
STATA

tab1 CategVar1 CategVar2 CategVar3
SAS

proc freq; tables CategVar1 CategVar2 CategVar3; 
R

library(descr) # install library if needed
freq(as.ordered(myData$CategVar1))
freq(as.ordered(myData$CategVar2))
freq(as.ordered(myData$CategVar3))

Categorical Variables (Plot)

Python
import matplotlib.pyplot as plt
import seaborn as sns

sns.countplot(x="CategVar", data=myData)
plt.xlabel("Label for CategVar")
plt.title("Descriptive Title")
SPSS
FREQUENCIES VARIABLES=CategVar1 CategVar2 CategVar3
/ORDER=ANALYSIS.
STATA
graph bar, over(CategVar)
SAS
proc gchart;
VBAR CategVar/ Discrete type=PCT Width=30;
R
library(ggplot2)
ggplot(data=myData)+
   geom_bar(aes(x=CategVar))+
   ggtitle("Descriptive Title")

Quantitative Variables (mean, sd, etc)

Python
myData['QuantVar1'].dropna().describe()
SPSS

DESCRIPTIVES VARIABLES=QuantVar1 QuantVar2 QuantVar3
   /STATISTICS=MEAN STDDEV. 
STATA

summarize QuantVar1 QuantVar2 QuantVar3
SAS

proc means; var QuantVar1 QuantVar2 QuantVar3; 
R

# Repeat for each variable 
summary(myData$QuantVar1) 
mean(myData$QuantVar1, na.rm = TRUE) 
sd(myData$QuantVar1, na.rm = TRUE)

Quantitative Variables (Plot)

Python
import matplotlib.pyplot as plt
import seaborn as sns


sns.histplot(data=myData, x='QuantVar', bins=30, kde=True)
plt.title("Descriptive Title Here")
SPSS
DESCRIPTIVES VARIABLES=QuantVar1 QuantVar2 QuantVar3
/STATISTICS=MEAN STDDEV.
STATA
histogram QuantVar
SAS
proc GCHART; VBAR QuantVar;
R
ggplot(data=myData)+
 geom_histogram(aes(x=QuantVar))+
 ggtitle("Descriptive Title Here")

Bivariate Analysis
Categorical-Categorical (crosstabs)

Python
#Create a contingency table 
tab1 = pd.crosstab(myData['CategResponseVar'], myData['CategExplanatoryVar'])

tab1_colProp = tab1.div(tab1.sum(axis=0), axis=1) # Proportions for each column

tab1_rowProp = tab1.div(tab1.sum(axis=1), axis=0) # Proportions for each row

tab1_cellProp = tab1.div(tab1.sum().sum()) # Proportions for the entire table


#OR can do it this way


#import pandas as pd

print (pd.crosstab(myData['CategResponseVar'],myData['CategExplanatoryVar'],margins=True))

# get column proportions
print(pd.crosstab(myData['CategResponseVar'],myData['CategExplanatoryVar'],margins=True,normalize='columns'))

# get row proportions
print(pd.crosstab(myData['CategResponseVar'],myData['CategExplanatoryVar'],margins=True,normalize='index'))

# get cell proportions
print (pd.crosstab(myData['CategResponseVar'],myData['CategExplanatoryVar'],margins=True,normalize='all'))
SPSS

* numbers.
CROSSTABS
   /TABLES=CategResponseVar by CategExplanatoryVar 
   /CELLS COUNT ROW COLUMN TOTAL. 

* visualization: use GUI point-and-click. 
STATA

// numbers
tab CategResponseVar CategExplanatoryVar, row column cell

// visualization
graph bar (mean) CategResponseVar, over(CategExplanatoryVar)
SAS

* numbers;
proc freq; tables CategResponseVar*CategExplanatoryVar; 

* visualization;
proc GCHART; vbar CategExplanatoryVar /discrete type=mean sumvar=CategResponseVar; 
R
tab1 <- table(myData$CategResponseVar, myData$CategExplanatoryVar)
tab1_colProp <- prop.table(tab1, 2) # column proportions
tab1_rowProp <- prop.table(tab1, 1) # row proportions
tab1_cellProp <- prop.table(tab1) # cell proportions

Categorical-Categorical (Plot)

Python
import matplotlib.pyplot as plt
import seaborn as sns


sns.barplot(data=graph_data, x='CategExplanatoryVar', y='BinaryResponseVar', estimator='mean')
plt.xlabel('Label forCategExplanatoryVar')
plt.ylabel('Label forQuantResponseVar')
plt.title('Descriptive Title Here')
SPSS
* visualization: use GUI point-and-click.
STATA

// install and use caplot 

ssc install catplot 

// to show frequencies (conditional) from a  two-way table

catplot, over(CategResponseVar) over(CategExplanatoryVar) blabel(bar)

// to show proportions (conditional; col percent) from a  two-way table

catplot, over(CategResponseVar) over(CategExplanatoryVar)  percent(CategExplanatoryVar) blabel(bar)

// visualization to show percents within group – can only be used
// when response variable has 2 levels.
// Requires data management that has response variable coded as a binary 0/1

graph bar BinaryCategoricalResponseVar, over(CategExplanatoryVar)

SAS

/*Code below assumes your response variable is coded as 1 and 0*/

Proc SGPLOT; vbar ExplVar /response=RespVar stat=mean;

R


# visualization - Assumes response variable is coded as 0/1

ggplot(data=graph_data) +
    stat_summary(aes(x=CategExplanatoryVar, y=BinaryResponseVar),  fun=”mean”, geom=”bar”) +
   ylab(“Proportion of Subjects at each Response Level within each group”) +
   ggtitle(“Informative Title Here”)

Quantitative-Categorial (means by group)

Python
myData['QuantResponseVar'].groupby(myData['CategExplanatoryVar']).describe()

# Finding the average of a quantitative variable by a categorical variable
average_by_group = myData.groupby('CategExplanatoryVar')['QuantResponseVar'].mean(skipna=True)

print
(average_by_group)

# Finding the standard deviation of a quantitative variable by a categorical variable

std_dev_by_group = myData.groupby('CategExplanatoryVar')['QuantResponseVar'].std(skipna=True)

print
(std_dev_by_group)

# Finding the sample size by group

sample_size_by_group = myData.groupby('CategExplanatoryVar')['QuantResponseVar'].count()

print(sample_size_by_group)
SPSS

* numbers.
MEANS TABLES= CategExplanatoryVar by QuantResponseVar 
   /CELLS MEAN COUNT STDDEV. 

* visualization: use GUI point-and-click.
STATA

// numbers
bys CategExplanatoryVar: sum QuantResponseVar

// visualization
graph box QuantResponseVar, over(CategExplanatoryVar)
SAS

* numbers;
proc sort; by CategExplanatoryVar; 
proc means; var QuantResponseVar; 
   by CategExplanatoryVar;

* visualization;
proc gchart; vbar CategExplanatoryVar /discrete type=mean sumvar=QuantResponseVar;

R

# To find the average of a quantitative variable by a categorical variable:
by(myData$QuantResponseVar, myData$CategExplanatoryVar, mean, na.rm = TRUE)

# To find the standard deviation of a quantitative variable by a categorical variable:
by(myData$QuantResponseVar, myData$CategExplanatoryVar, sd, na.rm = TRUE)

# To find the sample size by group: by(myData$QuantResponseVar, myData$CategExplanatoryVar, length)

Categorial-Quantitative (Plot)

Python
import matplotlib.pyplot as plt
import seaborn as sns

#Option 1 bar plot

sns.barplot(data=myData, x='CategExplanatoryVar', y='QuantResponseVar', estimator='mean') 
plt.ylabel("Mean of QuantResponseVar")
plt.title("Mean of QuantResponseVar by CategExplanatoryVar")

#Option 2 : Box plot

sns.boxplot(data=myData, x='CategExplanatoryVar', y='QuantResponseVar') 
plt.ylabel("Mean of QuantResponseVar")
plt.title("Descriptive Title Here")

SPSS
* visualization: use GUI point-and-click.
STATA
\\Option 1: Boxplot
graph box QuantResponseVar, over(CategExplanatoryVar)
\\Option 2: Bar Chart to show means
graph bar QuantResponseVar, over(CategExplanatoryVar)

SAS
proc gchart; vbar CategExplanatoryVar /discrete type=mean sumvar=QuantResponseVar;


R

# Option 1: Bar plot
ggplot(data=myData)+
    stat_summary(aes(x=CategExplanatoryVar, y=QuantResponseVar),
      fun=mean, geom=”bar”)

# Option 2: Boxplot
ggplot(data=myData)+
   geom_boxplot(aes(x=CategExplanatoryVar, y=QuantResponseVar))+
   ggtitle(“Descriptive Title Here”)

Quantitative-Quantitative (plot)

Python
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(data=myData, x='QuantExplanatoryVar', y='QuantResponseVar')
sns.regplot(data=myData, x='QuantExplanatoryVar', y='QuantResponseVar', ci=None, line_kws={"color": "red"})
plt.xlabel('QuantExplanatoryVar')
plt.ylabel('QuantResponseVar')
plt.title('Scatter plot with Linear Regression')

#OR can use
sns.regplot(x="QuantExplanatoryVar",y="QuantResponseVar",fit_reg=False,data=myData) plt.xlabel('Label forQuantExplanatoryVar') plt.ylabel('Label forQuantResponseVar') plt.title('Descriptive Title Here’)
SPSS

* visualization.
GRAPH 
   /scatterplot(bivar)=QuantExplanatoryVar with QuantResponseVar. 
STATA

// visualization
twoway (scatter QuantResponseVar QuantExplanatoryVar) (lfit QuantResponseVar QuantExplanatoryVar) 
SAS

* visualization;
proc gplot; plot QuantResponseVar*QuantExplanatoryVar; 
R

ggplot(data=myData)+
   geom_point(aes(x=QuantExplanatoryVar, y=QuantResponseVar))+
   geom_smooth(aes(x=QuantExplanatoryVar, y=QuantResponseVar), method="lm")

Multivariate (bivariate, by subpopulation (third variable – categorical))

Categorical-Categorical (crosstabs) with third var

Python
import matplotlib.pyplot as plt
import seaborn as sns


sns.factorplot(x="CategExplanatoryVar", y="CategResponseVar", hue="CategThirdVar",
data=myData, kind="bar", ci=None)
plt.xlabel('Label for CategExplanatoryVar')
plt.ylabel('Label for CategResponseVar') plt.title('Descriptive Title Here')
SPSS

* numbers.
CROSSTABS
   /TABLES=CategResponseVar BY CategExplanatoryVar BY CategThirdVar. 

* visualization: use GUI point-and-click. 
STATA

// numbers
bys CategThirdVar: tab CategResponseVar CategExplanatoryVar, row column cell 

// visualization
bys CategThirdVar: graph bar (mean) CategResponseVar, over(CategExplanatoryVar)
SAS

* numbers;
proc sort; by CategThirdVar; 
proc freq; tables CategResponseVar*CategExplanatoryVar; 
   by CategThirdVar;

* visualzation;
proc gchart; vbar CategExplanatoryVar /discrete type=mean sumvar=CategResponseVar; 
   by CategThirdVar;
R


# visualization
ggplot(data=myData)+    
stat_summary(aes(x=CategExplanatoryVar, y=BinaryResponseVar), fun="mean", geom="bar")+
facet_grid(. ~ CategThirdVar)+
ggtitle("Descriptive Title Here")

Categorical-Categorical (Plot) with third var

PYTHON

 

 

 

 

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

# Plot setup

plt.figure(figsize=(10, 6))

sns.barplot(

data=myData,

x='CategExplanatoryVar',

y='BinaryResponseVar',

hue='CategThirdVar', # Creates separate bars for each level of CategThirdVar within the same plot

estimator='mean', # Uses mean of BinaryResponseVar for each category

ci=None # Remove if you want confidence intervals

)

# Customize plot plt.title("Descriptive Title Here")

plt.xlabel("Explanatory Variable")

plt.ylabel("Mean of Binary Response Variable") plt.legend(title="Third Variable")

STATA // visualization to show frequencies
ssc install catplot
catplot CategResponseVar CategExplanatoryVar// visualization to show percents from overall total
ssc install catplot
catplot CategResponseVar CategExplanatoryVar, percent

// visualization to show percents within group – best to use when
// response variable is more than 2 levels
graph hbar (percent), over(CategResponseVar) over(CategExplanatoryVar) percent stack asyvars

// visualization to show percents within group – can only be used
// when response variable has 2 levels.
// Requires data management that has response variable coded as a binary 0/1
graph bar BinaryCategoricalResponseVar, over(CategExplanatoryVar)

SAS
proc GCHART; vbar CategExplanatoryVar / subgroup = CategResponseVar;
R


# visualization - Assumes response variable is coded as 0/1
# visualization

ggplot(data=myData)+                           stat_summary(aes(x=CategExplanatoryVar, y=BinaryResponseVar), fun="mean", geom="bar)+                                                           facet_grid(. ~ CategThirdVar)+
ggtitle("Descriptive Title Here")

Quantitative-Categorical (means by group) with third var

Python
import pandas as pd 

table = myData.groupby(['CategExplanatoryVar', 'CategThirdVar'])['QuantResponseVar'].mean().reset_index()

print(table)
SPSS

MEANS TABLES= QuantResponseVar BY CategExplanatoryVar BY CategThirdVar
   /CELLS MEAN COUNT STDDEV. 
STATA

bys CategExplanatoryVar CategThirdVar: su QuantResponseVar 
SAS

proc sort; by CategExplanatoryVar CategThirdVar; 
proc means; var QuantResponseVar; 
   by CategExplanatoryVar CategThirdVar; 
R

ftable(by(myData$QuantResponseVar, list(myData$CategExplanatoryVar, myData$CategThirdVar), mean, na.rm = TRUE))

Categorical-Quantitative (Plot) by Third Variable

Python
import matplotlib.pyplot as plt
import seaborn as sns

sns.barplot(x="CategExplanatoryVar", y="QuantResponseVar", hue="CategThirdVar", data=myData, ci=None)
plt.xlabel('Label for CategExplanatoryVar')
plt.ylabel('Label for QuantResponseVar')
plt.title('Descriptive Title Here')
SPSS
* visualization: use GUI point-and-click.
STATA
graph box QuantResponseVar, over(CategExplanatoryVar) over(CategThirdVar)
SAS
proc sort; by CategExplanatoryVar CategThirdVar;
Proc SGPLOT; vbar ExplVar /response=RespVar group=ThirdVar groupdisplay=cluster stat=mean;
xaxis label="Description of Category Variable";
keylegend / title="Description of Group Variable"; run;
R
ggplot(data=myData)+
   geom_boxplot(aes(x=ExplanatoryVar, y=QuantResponseVar))+
   facet_grid(.~CategThirdVar)+
   ggtitle("Descriptive Title Here")

quantitative-quantitative (scatterplot) with third var

Python
import matplotlib.pyplot as plt
import seaborn as sns

# Create a FacetGrid to facet by the 'CategThirdVar' column

g = sns.FacetGrid(myData, col="CategThirdVar", height=4, aspect=1.2)

# Map scatterplot with regression line to each facet

g.map_dataframe(sns.scatterplot, x="QuantExplanatoryVar", y="QuantResponseVar", color="blue") g.map_dataframe(sns.lineplot, x="QuantExplanatoryVar", y="QuantResponseVar", ci=None, color="orange", linestyle="--")

SPSS

* numbers.
SORT CASES  BY region. 
SPLIT FILE LAYERED BY region. 
CORRELATIONS 
  /VARIABLES=id age 
  /PRINT=TWOTAIL NOSIG 
  /MISSING=PAIRWISE.
SPLIT FILE OFF.

* visualization.
SORT CASES  BY region. 
SPLIT FILE LAYERED BY region. 
GRAPH
  /SCATTERPLOT(BIVAR)=id WITH exp
  /MISSING=LISTWISE.
 SPLIT FILE OFF.
STATA

// visualization
twoway (scatter QuantResponseVar QuantExplanatoryVar) (lfit QuantResponseVar QuantExplanatoryVar), by(CategThirdVar)
SAS

* visualization;
proc sort; by CategThirdVar
proc gplot; plot QuantResponseVar*QuantExplanatoryVar;
   by CategThirdVar; 
R

ggplot(data=myData)+
geom_point(aes(x=QuantExplanatoryVar, y=QuantResponseVar))+
geom_smooth(aes(x=QuantExplanatoryVar, y=QuantResponseVar), method=”lm”)+
facet_grid(. ~ CategThirdVar)

Hypothesis Testing

Categorical-Categorical (chi-square)

Python
import pandas as pd
import scipy.stats as stats
ct1=pd.crosstab(myData['CategResponseVar'],myData['CategExplanatoryVar']) print ('chi-square value, p value,degrees of freedom,expected counts') cs1=stats.chi2_contingency(ct1) print(cs1) # column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct) # Post-hoc # for each Chi Sq pair data subset # (code below compares group 1 to group 2) recode1= {1: 1, 2:2} myData['COMP1v2']=myData['CategExplanatoryVar'].map(recode1) ct1=pd.crosstab(myData['CategResponseVar'],myData['COMP1v2']) cs1=stats.chi2_contingency(ct1) print(cs1)
SPSS

CROSSTABS
   /TABLES= CategResponseVar by CategExplanatoryVar
   /STATISTICS=CHISQ. 
STATA

tab CategResponseVar CategExplanatoryVar, chi2 row col 

*If post-hoc necessary look at two levels of explanatory variable at a time*

tab CategResponseVar CategExplanatoryVar if (CategExplanatoryVar=="GroupA" | CategExplanatoryVar=="GroupB") , chi2 row col

tab CategResponseVar CategExplanatoryVar if (CategExplanatoryVar=="GroupA" | CategExplanatoryVar=="GroupC") , chi2 row col
SAS

proc freq; tables CategResponseVar*CategExplanatoryVar/ chisq; 
R

myChi <- chisq.test(myData$CategResponseVar, myData$CategExplanatoryVar) 
myChi 
myChi$observed # for actual, observed cell counts 
prop.table(myChi$observed, 2) # for column percentages 
prop.table(myChi$observed, 1) # for row percentages


## Post-hoc test of which explanatory levels vary.
source(“https://raw.githubusercontent.com/PassionDrivenStatistics/R/master/ChiSquarePostHoc.R”)
myChi<-chisq.test(myData$CategResponseVar, myData$CategExplantoryVar)
Observed_table<-myChi$observed
chisq.post.hoc(Observed_table, popsInRows=FALSE, control=”bonferroni”)

## Or check Pearson Residuals
myChi$residuals

Quantitative-Categorial (anova)

Python

import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

model1= smf.ols(formula='QuantResponseVar~ C(CategExplanatoryVar)', data=myData)
results1=model1.fit()
print (results1.summary())
# Post-hoc test
sub1=myData[['QuantResponseVar','CategExplanatoryVar']].dropna()
mc1=multi.MultiComparison(sub1['QuantResponseVar'],sub1['CategExplanatoryVar'])
res1= mc1.tukeyhsd()print(res1.summary())

SPSS

UNIANOVA QuantResponseVar BY CategExplanatoryVar.

* for post-hoc test add the following options to the UNIANOVA command.
UNIANOVA QuantResponseVar BY CategExplanatoryVar.
   /POSTHOC=CategExplanatoryVar (TUKEY)
   /PRINT=ETASQ DESCRIPTIVE.
STATA

oneway QuantResponseVar CategExplanatoryVar, tabulate 

// for post-hoc test add the `sidak` option to oneway command
oneway QuantResponseVar CategExplanatoryVar, tabulate sidak
SAS

proc anova; class CategExplanatoryVar; 
model QuantResponseVar = CategExplanatoryVar; means CategExplanatoryVar; 

* for post-hoc test add the `duncan` option to proc anova command;
proc anova; class CategExplanatoryVar; 
model QuantResponseVar = CategExplanatoryVar; means CategExplanatoryVar /duncan; 
R

myAnovaResults <- aov(QuantResponseVar ~ CategExplanatoryVar, data = myData) 
summary(myAnovaResults)

# for post-hoc test
myAnovaResults <- aov(QuantResponseVar ~ CategExplanatoryVar, data = myData) 
TukeyHSD(myAnovaResults)

Quantitative-Quantitative (pearson correlation)

x

Python

import scipy

sub1=myData[['QuantResponseVar', 'QuantExplanatoryVar']].dropna()
print (‘Plot Title’)
print(scipy.stats.pearsonr(sub1['QuantResponseVar'],sub1['QuantExplanatoryVar']))
SPSS

CORRELATIONS
   /VARIABLES= QuantResponseVar QuantExplanatoryVar
   /STATISTICS DESCRIPTIVES. 
STATA

corr QuantResponseVar QuantExplanatoryVar

//OR
pwcorr QuantResponseVar QuantExplanatoryVar, sig
SAS

proc corr; var QuantResponseVar QuantExplanatoryVar;
R

cor.test(myData$QuantResponseVar, myData$QuantExplanatoryVar)

Moderation by a third variable

Categorical-Categorical (chi-square)

PYTHON

SPSS


import pandas as pd
import scipy.stats as stats

# Function to apply to each group
def chi_sq_test(group):
contingency_table = pd.crosstab(group['CategResponseVar'], group['CategExplanatoryVar']) chi2_result = stats.chi2_contingency(contingency_table)

observed = chi2_result[3]
proportions = observed / observed.sum(axis=0)

return { 'chi2_result': chi2_result, 'observed': observed, 'proportions': proportions }

print(myData.groupby('CategThirdVar').apply(chi_sq_test))


CROSSTABS /TABLES = CategResponseVar by CategExplanatoryVar by CategThirdVar /CELLS = COUNT ROW /STATISTICS = CHISQ.
STATA

bys CategThirdVar: tab CategResponseVar CategExplanatoryVar, chi2 row 
SAS

proc sort; by CategThirdVar; 
proc freq; tables CategResponseVar*CategExplanatoryVar/chisq; 
   by CategThirdVar; 
R

by(myData, 
myData$CategThirdVar, 
function(x) list( chisq.test(x$CategResponseVar, x$CategExplanatoryVar), chisq.test(x$CategResponseVar, x$CategExplanatoryVar)$observed, prop.table(chisq.test(x$CategResponseVar, x$CategExplanatoryVar)$observed, 2))) 

Quantitative-Categorial (anova)
Note: the following code snippets have the post-hoc options built-in

Python

import pandas as pd 
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Group data by 'CategThirdVar' and apply ANOVA within each group

results = myData.groupby('CategThirdVar').apply(
lambda x: {
'anova_model': ols('QuantResponseVar ~ CategExplanatoryVar', data=x).fit(),
'summary': sm.stats.anova_lm(ols('QuantResponseVar ~ CategExplanatoryVar', data=x).fit(), typ=2) } )

# To access each group's result, you can loop through or use the results variable directly.
# Example: print results for each group
for group, result in results.items():
print(f"\nGroup: {group}")
print("ANOVA Summary:\n", result['summary'])
SPSS

SORT CASES BY CategThirdVar.
SPLIT FILE LAYERED BY CategThirdVar.

ONEWAY QuantResponseVar BY CategExplanatoryVar
/STATISTICS DESCRIPTIVES
/POSTHOC = BONFERRONI ALPHA (0.05).

SPLIT FILE OFF.
STATA

bys CategThirdVar: oneway QuantResponseVar CategExplanatoryVar, tab sidak
SAS

proc sort; by CategThirdVar;
proc anova; class CategExplanatoryVar; 
   model QuantResponseVar=CategExplanatoryVar; 
   means CategExplanatoryVar; 
   by CategThirdVar /duncan; 
R

by(myData, 
	myData$CategThirdVar, 
	function(x) list(aov(QuantResponseVar ~ CategExplanatoryVar, data = x), summary(aov( QuantResponseVar ~ CategExplanatoryVar, data = x))))

Quantitative-Quantitative (pearson correlation)

Python
from scipy.stats import pearsonr
correlation, p_value = pearsonr(x, y)
print(f"Pearson Correlation Coefficient: {correlation}")
print(f"P-value: {p_value}")
SPSS

SORT CASES BY CategThirdVar.
SPLIT FILE LAYERED BY CategThirdVar.

CORRELATIONS
   /VARIABLES= QuantResponseVar QuantExplanatoryVar
   /STATISTICS DESCRIPTIVES.

SPLIT FILE OFF. 
STATA

bys CategThirdVar: corr QuantResponseVar QuantExplanatoryVar

//OR
bys CategThirdVar: pwcorr QuantResponseVar QuantExplanatoryVar, sig
SAS

proc sort; by CategThirdVar; 
proc corr; var QuantResponseVar QuantExplanatoryVar; 
   by CategThirdVar; 
R

by(myData,
	myData$CategThirdVar, 
	function(x) cor.test(x$QuantResponseVar, x$QuantExplanatoryVar))

Regression
Simple

Python
import statsmodels.formula.api as smf 
import pandas as pd
import numpy as np

#If the Explanatory Variable is Quantitative
my_lm_quant = smf.ols('QuantResponseVar ~ QuantExplanatoryVar', data=myData).fit()

print(my_lm_quant.summary())

#If the explanatory variable is Categorical
# Convert categorical variable to 'category' type if necessary
myData['CategExplanatoryVar'] = myData['CategExplanatoryVar'].astype('category')

my_lm_categ = smf.ols('QuantResponseVar ~ CategExplanatoryVar', data=myData).fit()

print
(my_lm_categ.summary())
SPSS

* note if explanatory var is categorical, make sure that the variable is type `nominal`.
REGRESSION
	/DEPENDENT QuantResponseVar
	/METHOD ENTER ExplanatoryVar.
STATA

//if explanatory var is quantitative
reg QuantResponseVar c.QuantExplanatoryVar

//if explanatory var is categorical
reg QuantResponseVar i.CategExplanatoryVar

SAS

* if explanatory var is quantitative;
proc glm; 
	model QuantResponseVar=QuantExplanatoryVar  /solution;

* if explanatory var is categorical;
proc glm; class CategExplanatoryVar; 
	model QuantResponseVar=CategExplanatoryVar /solution;
R

# if explanatory var is quantitative
my.lm <- lm(QuantResponseVar ~ QuantExplanatoryVar, data = myData) 
summary(my.lm)

# if explanatory var is categorical
my.lm <- lm(QuantResponseVar ~ factor(CategExplanatoryVar), data = myData) 
summary(my.lm)

Logistic

Python
import statsmodels.formula.api as smf 
import pandas as pd
import numpy as np

my_logreg = smf.logit('BinaryResponseVar ~ ExplanatoryVar + ExplanatoryVar2', data=myData).fit()

print(my_logreg.summary())

odds_ratios = np.exp(my_logreg.params)
print(odds_ratios)

#confidence intervals
conf = my_logreg.conf_int()

#Confidence intervals for the odds ratios
conf_odds_ratios = np.exp(conf)
print(conf_odds_ratios)

# Predicted probabilities for each observation predicted_probabilities = my_logreg.predict(myData) print("Predicted Probabilities:\n", predicted_probabilities)
SPSS

* note if explanatory var is categorical, make sure that the variable is type `nominal`.
LOGISTIC REGRESSION BinaryResponseVar with ExplanatoryVar ThirdVar1 ThirdVar2. 
STATA

// for all quantitative predictors, add `c.` before the variable name (e.g. c.height)
// for all categorical predictors, add `i.` before the variabe name (e.g. i.race)

logistic BinaryResponseVar ExplanatoryVar ThirdVar1 ThirdVar2
SAS

* list all categorical variables in the model under the class subcommand (e.g. CategThirdVar);

proc logistic; 
	class BinaryResponseVar(ref="referenceGroup") CategThirdVar; 
	model BinaryResponseVar = ExplanatoryVar CategThirdVar QuantThirdVar;
R

# if categorical variable is encoded as numeric, wrap it around with the factor() function (e.g. factor(ExplanatoryVar) )

my.logreg <- glm(BinaryResponseVar ~ ExplanatoryVar, data = myData, family = "binomial") 
summary(my.logreg)  # for p-values 
exp(my.logreg$coefficients)  # for odds ratios 
exp(confint(my.logreg))  # for confidence intervals on the odds ratios

# If you have many explanatory variables, you can just continue to add them in my.logreg <- glm(BinaryResponseVar ~ ExplanatoryVar + ExplanatoryVar2, data = myData, family = "binomial") summary(my.logreg)  # for p-values exp(my.logreg$coefficients)  # for odds ratios exp(confint(my.logreg))  # for confidence intervals on the odds ratios

Multiple regression

Python
import statsmodels.formula.api as smf

my_lm = smf.ols('QuantResponseVar ~ QuantExplanatoryVar + CategExtraVar', data = myData).fit()

print(my_lm.summary())
SPSS

* note if explanatory var is categorical, make sure that the variable is type `nominal`.
REGRESSION
   /DEPENDENT QuantResponseVar
   /METHOD ENTER ExplanatoryVar ExtraVar1 ExtraVar2.
STATA

//if a predictor var is quantitative, add `c.`. if a predictor var is categorical, add `i.`.

reg QuantResponseVar i.CategExplanatoryVar i.CategExtraVar1 c.QuantExtraVar2

SAS

* if a predictor var is categorical, add to `class`;
proc glm; 
	class CategExplanatoryVar; 
	model QuantResponseVar=CategExplanatoryVar ExtraVar1 /solution;
R

# if a predictor var is categorical, wrap the var with factor() (e.g. factor(CategExtraVar) )

my.lm <- lm(QuantResponseVar ~ QuantExplanatoryVar + factor(CategExtraVar), data = myData) 
summary(my.lm)

Regression with Interaction Term

Incorporating interaction term when response is Quantitative (Multiple Linear Regression)

PYTHONimport statsmodels.formula.api as smf

my_lm = smf.ols('QuantResponseVar ~ ExplanatoryVar + CategoricalModeratingVar + ExplanatoryVar:CategoricalModeratingVar', data=myData).fit()

print(my_lm.summary())
SPSS
* note if explanatory var is categorical, make sure that the variable is type `nominal`.
REGRESSION
/DEPENDENT QuantResponseVar
/METHOD ENTER ExplanatoryVar ExtraVar1 ExtraVar2.
STATA
//to incorporate a moderator (statistical interaction term) in your model add `#` between the two terms
// add `i.` for categorical terms in the interaction and `c.` for quantitative terms in the interaction.
reg QuantResponseVar QuantExplanatoryVar i.CategoricalModeratingVar i.CategoricalModeratingVar#c.QuantExplanatoryVar 
SAS
* if a predictor var is categorical, add to `class`;
proc glm;
class CategoricalModeratingVar;
model QuantResponseVar=ExplanatoryVar|CategoricalModeratingVar /solution;
R# to incorporate a statistical interaction between two of your explanatory variables
my.lm <- lm(QuantResponseVar ~ ExplanatoryVar + CategoricalModeratingVar +
ExplanatoryVar*CategoricalModeratingVar, data = myData)
summary(my.lm)

Incorporating interaction term when response is Categorical (Logistic)

Python

#pip install statsmodels

import statsmodels.formula.api as smf

y_logreg = smf.logit('BinaryResponseVar ~ ExplanatoryVar + CategoricalModeratingVar + ExplanatoryVar:CategoricalModeratingVar', data=myData).fit()

print(y_logreg.summary())

SPSS
* note if explanatory var is categorical, make sure that the variable is type `nominal`.
LOGISTIC REGRESSION BinaryResponseVar with ExplanatoryVar ThirdVar1 ThirdVar2.
STATA
// for all categorical predictors, add `i.` before the variabe name (e.g. i.race) and `c.` before quantitative variables
logistic BinaryResponseVar QuantExplanatoryVar i.CategoricalModeratingVar i.CategoricalModeratingVar#c.QuantExplanatoryVar
SAS
* list all categorical variables in the model under the class subcommand (e.g. CategThirdVar);
proc logistic;
class BinaryResponseVar(ref="referenceGroup") CategoricalModeratingVar;
model BinaryResponseVar = ExplanatoryVar|CategoricalModeratingVar;
R
# if categorical variable is encoded as numeric, wrap it around with the factor() function (e.g. factor(ExplanatoryVar3) )
my.logreg <- glm(BinaryResponseVar ~ ExplanatoryVar + CategoricalModeratingVar + ExplanatoryVar*CategoricalModeratingVar, data = myData, family = "binomial")
summary(my.logreg) # for p-values
exp(my.logreg$coefficients) # for odds ratios
exp(confint(my.logreg)) # for confidence intervals on the odds ratios