Estimating mood from existing surveys

This page explains how to use my dynamic Bayesian latent variable model to estimate national public opinion on a certain topic across multiples countries and points in time.

This model is suitable when you want to estimate aggregate (i.e., national) public opinion across many countries and many time units (typically years) and where existing survey data are fragmented across many different survey questions with numerous gaps in each national time-series.

I first developed this model to measure democratic support (AKA democratic mood) using existing survey data from the World Values Survey, Global Barometer Studies, and other cross-national public opinion projects. This work has resulted in several publications. It has also been applied to estimate European opinion about immigration.

The statistical details of the model are described in this article. Here I focus on how you might go about actually running this model, and obtaining estimates, using R and Stan.

Gathering data

You should gather nationally aggregated survey data on your topic of interest and organise it in “long” format, where each row in the dataset indicates a unique country-year-survey item combination. The output below illustrates using the first few rows of the support for democracy dataset (available here)

##       Country Year        Item Sample  Response Project
## 1 Afghanistan 2005    Army_asb    874 0.3306636     asb
## 2 Afghanistan 2005 EvDemoc_asb    874 0.7860412     asb
## 3 Afghanistan 2005  Strong_asb    874 0.2848970     asb
## 4     Albania 1998  Church_wvs    999 0.8870000     wvs
## 5     Albania 1998 EvDemoc_wvs    999 0.9430000     wvs
## 6     Albania 2002    Army_wvs   1000 0.8030000     wvs

The dataset has a column indicating the country unit (Country). There are also indicators for survey year (Year) and survey question (Item). You should come up with a list of Item codes that uniquely identify each unique survey item-survey project combination in your data. The data variables are Sample and Response: the latter indicates the (weighted) percentage of the national sample who were asked the particular item, and offered a pro-democratic response; the former shows the number of respondents who were asked that question. The dataset also includes a variable showing the opinion project which fielded the survey in question; this is not required as it is implicit in the item coding.

Note, if a particular survey item is fielded by the same opinion project more than once in a particular country and year, then add these results together. If there are many such instances, consider using more finely-grained unit of time, such as quarter.

Setting things up in R

The model is coded in Stan’s statistical language. You do not need to interact with the Stan code unless you would like to edit it somehow. This Stan model is called from R using the cmdstanr package, although the Rstan package can also be used (albeit with slightly different code). The R code is here and the Stan code is here.

You should install Stan and cmdstanr, if you haven’t already. See https://mc-stan.org/cmdstanr/.

The R code requires the following R packages, which should be installed if they are not already.

# libraries
library(arm)
library(dplyr)
library(tidyr)
library(loo)
library(ggplot2)
library(bayesplot)
library(cmdstanr)

Next, we set some cmdstanr options, set the local folder (this will not work outside of RStudio) and load the survey data

# options
options("cmdstanr_verbose" = TRUE)
options(mc.cores = parallel::detectCores())

# folders
WD = rstudioapi::getActiveDocumentContext()$path 
setwd(dirname(WD))
print( getwd() )

# read support for dem data
supdem = read.csv("Support for democracy raw 0522.csv")

We perform some edits: sorting the dataset, checking that there are no 0% values for the Response variable (which should be removed), and creating the response count variable RespN that will be required for the beta-binomial specification. (The data can also be collected this way if desired, as long as weights are used to create the response counts).

An unique item-country indicator, ItemCnt, is created – this is necessary to estimate the model. The initial year, year0, variable should also be set. Since the support for democracy data starts in 1988 (for certain countries), we set this to the year immediately before, 1987.

# remove NAs
supdem = supdem[!supdem$Response==0, ]

# order
supdem = arrange(supdem, Country, Year)

# create vector of response counts if they do not exist
supdem$RespN = round(supdem$Response*supdem$Sample)

# set first year
year0 = 1987 # = year before first year of available survey data
supdem = supdem[supdem$Year > year0,]

# create item by country indicators
supdem = unite(supdem, ItemCnt, c(Item, Country), sep = "_", remove = FALSE)
sddem1 = supdem

The dynamic Bayesian latent variable model adjusts for the bias induced by the usage of particular items in particular countries, i.e., item non-equivalence. It is therefore not desirable to include items which are only used in one country. No such items were included in the support for democracy dataset. A cautious approach would be also be to drop items which are fielded only once, even if in multiple countries. However, it is not always necessary to do so if information is available (e.g., from other items fielded in the same year) to identify the item-country parameters.

It may also be desirable to estimate mood only for countries which have been surveyed a certain number of times. The following code identifies and drops countries for which only one year of survey measures are available. This can be edited by changing the number in the cnt.obs.years > 1 statement.

# identify countries with few years of data
cnt.obs.years = rowSums(table(sddem1$Country, sddem1$Year) > 0)
sort(cnt.obs.years)

# run the next line to drop countries with less than 2 years of data
sddem2 = sddem1[sddem1$Country %in% levels(sddem1$Country)[cnt.obs.years > 1], ]

We then extract data vectors which will be used in the Stan model. Adjust the year in the n.yrs = 2020-year0 statement to specify the most recent year for which you require mood estimates.

## Prepare data for stan

# factorise
sddem2$Country = as.factor(as.character(sddem2$Country))
sddem2$Item = as.factor(as.character(sddem2$Item))
sddem2$ItemCnt = as.factor(as.character(sddem2$ItemCnt))
sddem2$Project = as.factor(as.character(sddem2$Project))
sddem2$Year = sddem2$Year-year0

# extract data
n.items = length(unique(sddem2$Item))
n.cntrys = length(unique(sddem2$Country))
n.yrs = 2020-year0 # estimates up to 2020
n.resp = dim(sddem2)[1]
n.itm.cnt = length(unique(sddem2$ItemCnt))
cntrys = as.numeric(factor(sddem2$Country))
cnt.names = levels(sddem2$Country)
items = as.numeric(factor(sddem2$Item))
yrs = sddem2$Year
itm.cnts = as.numeric(factor(sddem2$ItemCnt))
mean.resp.log = logit(mean(sddem2$Response))

# create item-country length indicator for items
item.ind.kp = rep(0, length(levels(sddem2$ItemCnt)))
for(i in 1:length(levels(sddem2$Item))) {
  item.ind.kp[grepl(levels(sddem2$Item)[i], levels(sddem2$ItemCnt))] = i
}
item.ind.len = sapply(lapply(levels(sddem2$Item), function(x) grep(x, levels(sddem2$ItemCnt))), length)

Assemble the bits of data into the list format which is preferred by cmdstanr, checking that there are no missing values and other errors:

# specify data for stan
dat.1 = list(N=n.resp, K=n.items, T=n.yrs, J=n.cntrys, P=n.itm.cnt, jj=cntrys, tt=yrs, 
             pp=itm.cnts, kk=items, it_len=item.ind.len, 
             x=sddem2$RespN, samp=sddem2$Sample, mn_resp_log=mean.resp.log)
sapply(dat.1, summary)

Fitting the model

Before fitting the model, we need to specify the parameters which are to be tracked and saved in the posterior draws. The names of these correspond with the notation used in the model description in Claassen (2019). The latent mood estimates, which are probably the parameters of most interest, are labeled theta.

# parameters to save 
pars.1 = c("Sigma","Omega","sigma_delta","sigma_theta","phi","mu_lambda","lambda","gamm","delta",
           "theta","x_pred","log_lik")

The code specifies that 4 chains are to be run with 1000 iterations each, the first 500 of which are used to “warm up” the MCMC algorithms. This should be sufficient for convergence. However, I would suggest testing first with a small number of interations, e.g., 10, because 1000 iterations may take 15 minutes or more (depending on your data and computer hardware).

# iterations for MCMC simulation
n.iter = 1000
n.warm = 500
n.samp = n.iter - n.warm
n.chn = 4

We then compile and run the model, using the lists of data and parameters we have just specified, as well as the Stan model (which is provided in a separate file, stan.mod6.7.stan).

The control parameters, adapt_delta and max_treedepth are set higher than their default values of 0.80 and 10. You can adjust adapt_delta: lower to speed up the sampling; higher if divergences are reported. See here for more detail.

# compile model
stan.mod = cmdstan_model('stan.mod6.7.stan')

# Stan fit
fit.mod= stan.mod$sample(
  data = dat.1,
  chains = n.chn,
  init = 1,
  parallel_chains = n.chn,
  iter_warmup = n.warm,
  iter_sampling = n.samp,
  refresh = round(n.iter/20, 0),
  adapt_delta = 0.90, 
  max_treedepth = 13,
  save_warmup = FALSE
)

Checking convergence

It is necessary to examine whether the MCMC simulation has converged and therefore whether the estimates of mood are reliable. The following code will report some basic diagnostics, a summary of the main parameters, a list of parameters with the highest (i.e., most concerning) Rhat values, and a list of the parameters with the lowest (i.e., most concerning) effective number of samples. Further guidance on how to interpret these diagnostics can be sought in the various vignettes and tutorials on the Stan website.

# Examine model fit
res = fit.mod
res$cmdstan_diagnose()
res.tab = res$print(pars.1, max_rows=80, digits=3)
sum = res$summary(pars.1)
print(sum[order(sum$rhat, decreasing=TRUE), ], n=50)
res_neff_ratio = neff_ratio(res)
res_neff_ratio[order(res_neff_ratio, decreasing=FALSE)][1:50]

A traceplot is also very helpful for diagnostic purposes. See https://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html for details on this and other “visual MCMC diagnostics”.

# traceplot
tp.pars = c("Sigma[1,1]","Sigma[2,2]","Omega[1,2]","sigma_theta","sigma_delta",
            "phi","lambda[31]","delta[23]","gamm[31]","theta[26,132]","theta[31,121]")
tp = bayesplot::mcmc_trace(res$draws(tp.pars), size=0.3, np=nuts_params(res))
tp

If the model has converged, you are now able to extract and save your country-year estimates of mood. The following code will standardize the theta estimates, extract the country-year mean values, standard deviations, and upper and lower 95% credible intervals. These are saved as a csv file in “long” country-year format. Estimates before a country’s first year of survey data are dropped. These estimates can then be used in further analyses.

## Extract and save mood estimates

theta.m.out = apply(res$draws("theta"), 3, as.vector)
(theta.m.mean = mean(as.vector(theta.m.out)))
(theta.m.sd = sd(as.vector(theta.m.out)))
theta.m.std = (theta.m.out - theta.m.mean) / theta.m.sd # standardize
theta.m.t = apply(theta.m.std, 1, function(x) t(x) )
theta.pe = apply(theta.m.t, 1, mean)
theta.u95 = apply(theta.m.t, 1, quantile, probs=c(0.975))
theta.l95 = apply(theta.m.t, 1, quantile, probs=c(0.025))
theta.sd = apply(theta.m.t, 1, sd)
theta.m.df = data.frame(Country=rep(cnt.names, each=n.yrs), 
                        Year=rep(1988:2020, times=n.cntrys), SupDem=theta.pe, 
                        SupDem_u95=theta.u95, SupDem_l95=theta.l95, SupDem_sd=theta.sd)

# save country-year point estimates
write.csv(theta.m.df, "mood_est.csv", row.names=FALSE)

Estimating mood from existing surveys

Chris Claassen

Gathering data

Setting things up in R

Fitting the model

Checking convergence