This page explains how to use my dynamic Bayesian latent variable model to estimate national public opinion on a certain topic across multiples countries and points in time.
This model is suitable when you want to estimate aggregate (i.e., national) public opinion across many countries and many time units (typically years) and where existing survey data are fragmented across many different survey questions with numerous gaps in each national time-series.
I first developed this model to measure democratic support (AKA democratic mood) using existing survey data from the World Values Survey, Global Barometer Studies, and other cross-national public opinion projects. This work has resulted in several publications. It has also been applied to estimate European opinion about immigration.
The statistical details of the model are described in this article. Here I focus on how you might go about actually running this model, and obtaining estimates, using R and Stan.
You should gather nationally aggregated survey data on your topic of interest and organise it in “long” format, where each row in the dataset indicates a unique country-year-survey item combination. The output below illustrates using the first few rows of the support for democracy dataset (available here)
## Country Year Item Sample Response Project
## 1 Afghanistan 2005 Army_asb 874 0.3306636 asb
## 2 Afghanistan 2005 EvDemoc_asb 874 0.7860412 asb
## 3 Afghanistan 2005 Strong_asb 874 0.2848970 asb
## 4 Albania 1998 Church_wvs 999 0.8870000 wvs
## 5 Albania 1998 EvDemoc_wvs 999 0.9430000 wvs
## 6 Albania 2002 Army_wvs 1000 0.8030000 wvs
The dataset has a column indicating the country unit
(Country
). There are also indicators for survey year
(Year
) and survey question (Item
). You should
come up with a list of Item
codes that uniquely identify
each unique survey item-survey project combination in your data. The
data variables are Sample
and Response
: the
latter indicates the (weighted) percentage of the national sample who
were asked the particular item, and offered a pro-democratic response;
the former shows the number of respondents who were asked that question.
The dataset also includes a variable showing the opinion project which
fielded the survey in question; this is not required as it is implicit
in the item coding.
Note, if a particular survey item is fielded by the same opinion project more than once in a particular country and year, then add these results together. If there are many such instances, consider using more finely-grained unit of time, such as quarter.
The model is coded in Stan’s statistical language. You do not need to
interact with the Stan code unless you would like to edit it somehow.
This Stan model is called from R using the cmdstanr
package, although the Rstan
package can also be used
(albeit with slightly different code). The R code is here and the
Stan code is here.
You should install Stan and cmdstanr
, if you haven’t
already. See https://mc-stan.org/cmdstanr/.
The R code requires the following R packages, which should be installed if they are not already.
# libraries
library(arm)
library(countrycode)
library(dplyr)
library(tidyr)
library(loo)
library(ggplot2)
library(bayesplot)
library(cmdstanr)
Next, we set some cmdstanr
options, set the local folder
(this will not work outside of RStudio) and load the survey data
# options
options("cmdstanr_verbose" = TRUE)
stanc_options = list("O1")
options(mc.cores = parallel::detectCores())
# folders
WD = rstudioapi::getActiveDocumentContext()$path
setwd(dirname(WD))
print( getwd() )
# read support for dem data
supdem = read.csv("Support for democracy raw 0522.csv")
We perform some edits: sorting the dataset, checking that there are
no 0% values for the Response
variable (which should be
removed), and creating the response count variable RespN
that will be required for the beta-binomial specification. (The data can
also be collected this way if desired, as long as weights are used to
create the response counts).
An unique item-country indicator, ItemCnt
, is created –
this is necessary to estimate the model. The initial year,
year0
, variable should also be set. Since the support for
democracy data starts in 1988 (for certain countries), we set this to
the year immediately before, 1987.
# remove NAs
supdem = supdem[!supdem$Response==0, ]
# order
supdem = arrange(supdem, Country, Year)
# create vector of response counts if they do not exist
supdem$RespN = round(supdem$Response*supdem$Sample)
# set first year
year0 = 1987 # = year before first year of available survey data
supdem = supdem[supdem$Year > year0,]
# create item by country indicators
supdem = unite(supdem, ItemCnt, c(Item, Country), sep = "_", remove = FALSE)
sddem1 = supdem
The dynamic Bayesian latent variable model adjusts for the bias induced by the usage of particular items in particular countries, i.e., item non-equivalence. It is therefore not desirable to include items which are only used in one country. No such items were included in the support for democracy dataset. A cautious approach would be also be to drop items which are fielded only once, even if in multiple countries. However, it is not always necessary to do so if information is available (e.g., from other items fielded in the same year) to identify the item-country parameters.
It may also be desirable to estimate mood only for countries which
have been surveyed a certain number of times. The following code
identifies and drops countries for which only one year of survey
measures are available. This can be edited by changing the number in the
cnt.obs.years > 1
statement.
# identify countries with few years of data
cnt.obs.years = rowSums(table(sddem1$Country, sddem1$Year) > 0)
sort(cnt.obs.years)
# run the next line to drop countries with less than 2 years of data
sddem2 = sddem1[sddem1$Country %in% levels(sddem1$Country)[cnt.obs.years > 1], ]
We then extract data vectors which will be used in the Stan model.
Adjust the year in the n.yrs = 2020-year0
statement to
specify the most recent year for which you require mood estimates.
## Prepare data for stan
# factorise
sddem2$Country = as.factor(as.character(sddem2$Country))
sddem2$Item = as.factor(as.character(sddem2$Item))
sddem2$ItemCnt = as.factor(as.character(sddem2$ItemCnt))
sddem2$Project = as.factor(as.character(sddem2$Project))
sddem2$Year = sddem2$Year-year0
# identify first year to estimate for each country
first.yr.m = by(sddem2$Year, sddem2$Country, min, simplify=TRUE)
cnt_start_yr = as.numeric(first.yr.m)
# extract data
n.items = length(unique(sddem2$Item))
n.cntrys = length(unique(sddem2$Country))
n.yrs = 2020-year0 # estimates up to 2020
n.resp = dim(sddem2)[1]
n.itm.cnt = length(unique(sddem2$ItemCnt))
cntrys = as.numeric(factor(sddem2$Country))
cnt.names = levels(sddem2$Country)
items = as.numeric(factor(sddem2$Item))
yrs = sddem2$Year
itm.cnts = as.numeric(factor(sddem2$ItemCnt))
mean.resp.log = logit(mean(sddem2$Response))
Create vectors mapping countries and years to estimates, and estimates to observed aggregate opinions. Also set up vectors indicating the number of replicates of each item and the length of each national time-series
# create item-country length indicator for items
item.ind.kp = rep(0, length(levels(sddem2$ItemCnt)))
for(i in 1:length(levels(sddem2$Item))) {
item.ind.kp[grepl(levels(sddem2$Item)[i], levels(sddem2$ItemCnt))] = i
}
item.ind.len = sapply(lapply(levels(sddem2$Item), function(x) grep(x, levels(sddem2$ItemCnt))), length)
# length of each estimated mood series for ragged array models
len_theta_ts = rep(0, n.cntrys)
for(i in 1:n.cntrys){
len_theta_ts[i] = length(cnt_start_yr[i]:n.yrs)
}
theta_n = sum(len_theta_ts)
# create r-length cntry vector
cntrys.r = rep(0, theta_n)
pos = 1
for(i in 1:n.cntrys) {
cntrys.r[pos:(len_theta_ts[i]+pos-1)] = (rep(i, len_theta_ts[i]))
pos = pos + len_theta_ts[i]
}
# create r-length year vector
year.r = rep(0, theta_n)
pos = 1
for(i in 1:n.cntrys) {
year.r[pos:(len_theta_ts[i]+pos-1)] = cnt_start_yr[i]:n.yrs
pos = pos + len_theta_ts[i]
}
# create R-vector mapping R support estimates to N opinions
n.map = data.frame(Obs=1:n.resp, Cntry=cntrys, Yr=yrs)
r.map = data.frame(Est=1:theta_n, Cntry=cntrys.r, Yr=year.r)
n.r.merg = merge(n.map, r.map, by=c("Cntry", "Yr"), all.x=TRUE)
n.r.merg = n.r.merg[order(n.r.merg$Obs),]
# create indicator vector for estimate start positions
est_pos = rep(1, n.cntrys)
est_pos[1] = 1
for(i in 2:n.cntrys) {
est_pos[i] = est_pos[i-1] + len_theta_ts[i-1]
}
Assemble the bits of data into the list format which is preferred by
cmdstanr
, checking that there are no missing values and
other errors:
# specify data for stan
dat.1 = list(N=n.resp, K=n.items, J=n.cntrys, P=n.itm.cnt, R=theta_n, jj=cntrys, pp=itm.cnts,
kk=items, rr=n.r.merg$Est, it_len=item.ind.len, est_pos=est_pos, len_theta_ts=len_theta_ts,
x=sddem2$RespN, samp=sddem2$Sample, mn_resp_log=mean.resp.log)
sapply(dat.1, summary)
Before fitting the model, we need to specify the parameters which are
to be tracked and saved in the posterior draws. The names of these
correspond with the notation used in the model description in Claassen (2019). The
latent mood estimates, which are probably the parameters of most
interest, are labeled theta
.
# parameters to save
pars.1 = c("Sigma","Omega","sigma_delta","sigma_theta","phi","mu_lambda","lambda","gamm","delta",
"theta","x_pred","log_lik")
The code specifies that 4 chains are to be run with 1000 iterations each, the first 500 of which are used to “warm up” the MCMC algorithms. This should be sufficient for convergence. However, I would suggest testing first with a small number of iterations, e.g., 10, because 1000 iterations may take 15 minutes or more (depending on your data and computer hardware).
# iterations for MCMC simulation
n.iter = 1000
n.warm = 500
n.samp = n.iter - n.warm
n.chn = 4
We then compile and run the model, using the lists of data and parameters we have just specified, as well as the Stan model (which is provided in a separate file).
The control parameters, adapt_delta
and
max_treedepth
are set higher than their default values of
0.80 and 10. You can adjust adapt_delta
: lower to speed up
the sampling; higher if divergences are reported. See here
for more detail.
# compile model
stan.mod = cmdstan_model('stan.mod6.7.stan')
# Stan fit
fit.mod= stan.mod$sample(
data = dat.1,
chains = n.chn,
init = 1,
parallel_chains = n.chn,
iter_warmup = n.warm,
iter_sampling = n.samp,
refresh = round(n.iter/20, 0),
adapt_delta = 0.90,
max_treedepth = 13,
save_warmup = FALSE
)
It is necessary to examine whether the MCMC simulation has converged and therefore whether the estimates of mood are reliable. The following code will report some basic diagnostics, a summary of the main parameters, a list of parameters with the highest (i.e., most concerning) Rhat values, and a list of the parameters with the lowest (i.e., most concerning) effective number of samples. Further guidance on how to interpret these diagnostics can be sought in the various vignettes and tutorials on the Stan website.
# Examine model fit
res = fit.mod
res$cmdstan_diagnose()
res.tab = res$print(pars.1, max_rows=80, digits=3)
sum = res$summary(pars.1)
print(sum[order(sum$rhat, decreasing=TRUE), ], n=50)
res_neff_ratio = neff_ratio(res)
res_neff_ratio[order(res_neff_ratio, decreasing=FALSE)][1:50]
A traceplot is also very helpful for diagnostic purposes. See https://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html for details on this and other “visual MCMC diagnostics”.
# traceplot
tp.pars = c("Sigma[1,1]","Sigma[2,2]","Omega[1,2]","sigma_theta","sigma_delta",
"phi","lambda[12]","delta[23]","gamm[12]","theta[132]","theta[589]")
tp = bayesplot::mcmc_trace(res$draws(tp.pars), size=0.3, np=nuts_params(res))
tp
If the model has converged, you are now able to extract and save your
country-year estimates of mood. The following code will standardize the
theta
estimates, extract the country-year mean values,
standard deviations, and upper and lower 95% credible intervals. These
are saved as a csv file in “long” country-year format. Estimates before
a country’s first year of survey data are dropped. These estimates can
then be used in further analyses.
## Extract and save mood estimates
theta.out = apply(res$draws("theta"), 3, as.vector)
(theta.mean = mean(as.vector(theta.out)))
(theta.sd = sd(as.vector(theta.out)))
theta.std = (theta.out - theta.mean) / theta.sd # standardize
theta.t = apply(theta.std, 1, function(x) t(x) )
theta.pe = apply(theta.t, 1, mean)
theta.u95 = apply(theta.t, 1, quantile, probs=c(0.975))
theta.l95 = apply(theta.t, 1, quantile, probs=c(0.025))
theta.sd = apply(theta.t, 1, sd)
theta.df = data.frame(Country=cnt.names[r.map$Cntry], Year=r.map$Yr+year0, SupDem=theta.pe,
SupDem_u95=theta.u95, SupDem_l95=theta.l95, SupDem_sd=theta.sd)
theta.df$ISO3c = countrycode(sourcevar=theta.df$Country, origin="country.name", destination="iso3c")
# save country-year point estimates
write.csv(theta.df, "mood_est.csv", row.names=FALSE)