Data transformation or not?

 Working as a student assistant in statistics, I have encountered multiple questions about data manipulation. For instance, why do researchers use this variable (directly as what the dataset documented without any manipulation), or why do they log this variable?

This blog explains the rationale of choices scholars makes in transforming data. For instance, Sama is interested to know whether a president blames other countries during a poor economy. In order to answer this research inquiry, she decided to collect presidents’ speeches from news reports. The words that Sama chose are “accuse (other countries) for economy or export/import” and “it is (other countries)’s that cause (some problem).” Let’s say that after a couple months, Sama finally collected a dataset of these news reports. She can’t use this data directly. Instead, the expected variable may be something like:

1)      The emotion embedded in this text: is it a neutral or a negative statement?

2)      The frequency or the ratio of a single foreign country being mentioned on a monthly basis

In order to generate these variables, some decisions are being made during this process, but some rationales for making these decisions may not be disclosed in the published article.

The process of replicating research is difficult, even if more and more scholars provide replication files and lengthier appendixes. Multiple choices were being made during the process of data collection, cleaning, re-classification, filling missing data, checking for duplicates, testing, regression models, graphs, etc. Sometimes, we still find it confusing even reading this replication file. For instance, why choose log(x) instead of x? What is the purpose to standardize the variable? Is it necessary to standardize the variable? Why re-classify the income variable from ten categories to three categories? Some choices are made due to their features, such as log(1+x); other choices are made due to their distribution.

For instance, we want to know how severe protests are in each state. Which states are more likely to encounter protests? The easiest way is to take the mean value of each states’ protests. However, from Table 1, we can immediately find that it is not reasonable to choose the mean value of protests.  Judging from Table 1, we can see the data in NY is really weird. They do not have a lot of protests each month, but suddenly, there is a surge of protests in June (see Figure 1). The first thing to do is that; you need to double-check if your data collection process is flawless. Do you miscalculate protests from other states or do you double count the same incident twice?


                                  Figure 1. The distribution of annual count of protests across states

If there are not mistakes being made during the collection process, then the next step you need to think about is how to deal with this outlier that differs vastly from the rest of NY’s observations. There are normally three options:

1)      Delete this outlier: it clearly solves the problem. But there are some weird numbers in Ohio as well, such as this January’s protest. Do you also want to delete this number? Remember, in this example, we only have 12 observations for 1 state, and deleting an observation from our already limited sample size is not a good choice.

2)      Take log(x): By logging the monthly counted number of states’ protests, which we refer to as x, suddenly, the difference between having 10 protests and having 1000 protests shrinks. It is a better choice than deleting it.

 

Note: There are different ways to deal with this problem and it is not only limited to these two options. 

Table 1. A fictional dataset of monthly count number of states protests

NY's protest

Texas's protest

Ohio's protest

Jan

1

333

444

Feb

3

66

11

Mar

2

242

66

Apr

5

53

55

May

1

123

87

Jun

1000

42

44

Jul

7

12

163

Aug

2

222

155

Sep

4

456

22

Oct

20

56

50

Nov

1

88

41

Dec

5

96

52

mean

87.58333

149.0833

99.16667


Comments

Popular Posts