The rationale of variable construction in Bueno de Mesquita and Smith (2009)

Working as a student assistant in statistics, I have received multiple questions about the variable operation that students have encountered in either reading a research paper or conducting a replication. Sometimes, it is difficult to understand why this author chooses to take the mean value of this variable instead of the log value. There are multiple decisions that were being made during the long and tedious process of collecting, cleaning, classifying, inputting missing values, checking the data distribution, and re-classifying, etc. Some choices are much clear to the readers, but some choices are not. Here, in this blog, I use the example of the selectorate theory and how Bueno de Mesquita and Smith (2010) measure the public threat using Banks’s (2007) data. This blog is especially illuminating for students who first enter the world of quantitative analysis or for student assistants who help their professors collect and clean data.

If we have every phone number from every person in this world, we would like to have a direct phone call to the original author to ask why he or she made this decision instead of another choice. However, in most cases, we do not have their phone number. Even if we manage to contact the original author, a likely scenario is that he or she may not remember the choices they made in that paper.

Some background knowledge about Bueno de Mesquita and Smith (2009)’s research: they argue that leaders allocate limited budgets in various kinds of policies in order to keep their supporters in line and avoid betrayals. Simply put, there are two kinds of threats: from elites and from the public. In order to test their argument, one step is to quantify the degree of threats from the elites and from the public.

Here, in this blog, I only describe how they deal with the data manipulation of the threat from the public. It is time-series cross-sectional data. The unit of research is country-year. 

On p.940, they describe their process of generating an index of the severity of public threats for each country. 

“In particular, we construct an index on the level of mass political events based on the Banks (2007) data coding of anti-government demonstrations, riots, general strikes, and revolutions. We create an index of mass political movements as follows. First, for each of the measures (x = demonstrations, riots, strikes, revolutions), we created a standardized version of the variable: = (In (1 + x) -mean (In (1 + x)))/ (standard deviation (In (1 + x))). Each of these standardized variables has mean zero and variance one. We then create an index, mass, by summing the four standardized variables and dividing by four.” (Bueno de Mesquita et al. 2005:940) To make the statement clearer, see the following equation. 

Two questions I have when I first saw the above description. The first question: what is the difference between the option of “using the Banks (2007) data” and using a z value of the Banks (2007) data? Without digging into the real data, from this description, I suspect that the distribution of demonstrations, riots, strikes, and revolutions may be really different. For instance, the number of strikes increases sharply in one year. Therefore, they want to standardize the distribution.  

However, it is important to note that standardization may not always be the best solution, especially when these four types of contentious activities differ in the cost of organizing and the risk of being arrested. Demonstrations, riots, strikes, and revolutions are different kinds of methods that protestors choose to advocate their policy demand. The cost and expected effectiveness of these measures are quite different. For instance, compared to demonstrations, riots are much more aggressive. Participants of these contentious activities bear a higher cost in participating in riots than in demonstrations. Police are more likely to enforce violent measures in riots than in demonstrations. Thereby, the risk of being arrested is higher in riots for the participants. You can question if it is reasonable to standardize these four contentious activities which seem to have very inherent differences.

The second question: what is the purpose of using ln(1+x) instead of using x?

Log transformation is widely used in quantitative studies since it consists of multiple features. One is that it sets a certain threshold and no value will exceed that value. This feature decreases the effect of an outlier on dragging the shape of your data.

The second is that we can expand ln(1+x), as the following equation shows. This feature is especially useful if you have missing data. For instance, if you would like to know the daily stock-market return rate for May 20 but you do not have every daily stock-market return rate. In this hypothetical scenario, if you have the monthly return rate, by using log transformation, we can deduct the daily return rate for May 20. 

 

Why do they add 1 instead of using x? The reason to use log(1+x) instead of log(x) may be that x contains 0 values for the same country-year. Since log0 is undefined or indeterminate, authors use log(1+x) instead. 

There are multiple choices that scholars make generating research, but not all decisions are well explained in their appendix and replication files. As a student or as someone passionate about their studies, we need to bear in mind what kind of data and what kind of data transformation and manipulation scholars choose to affect the results they get. There is no accurate way in transforming the data. It is partially determined by scholars’ knowledge of the features of a type of data and partially determined by the distribution of the data. Nevertheless, it is important to understand how other scholars transform their data, but the most important thing is to know your data. What are the distribution and the shape of your data? What is the data generation process? These questions are the very first steps before you start working on your research.

Some definitions (Morrow et al. 2008):

Support coalition: the set of those selectors who support the current leader. 

Winning coalition: the quantity of selectors whose support the leader must retain to remain in office.

Selectorate: the set of people in the polity who can take part in choosing a leader. 


Reference:

 Bueno De Mesquita, B., and A. Smith. 2009. Political Survival and Endogenous Institutional Change. Comparative Political Studies 42: 167-97.

Morrow, J. D., De Mesquita, B. B., Siverson, R. M., & Smith, A. (2008). Retesting selectorate theory: Separating the effects of W from other elements of democracy. American Political Science Review102(3), 393-400.

Comments

Popular Posts