Data transformation or not?
Working as a student assistant in statistics, I have encountered multiple questions about data manipulation. For instance, why do researchers use this variable (directly as what the dataset documented without any manipulation), or why do they log this variable?
This blog explains
the rationale of choices scholars makes in transforming data. For instance, Sama
is interested to know whether a president blames other countries during a poor
economy. In order to answer this research inquiry, she decided to collect presidents’
speeches from news reports. The words that Sama chose are “accuse (other
countries) for economy or export/import” and “it is (other countries)’s that
cause (some problem).” Let’s say that after a couple months, Sama finally
collected a dataset of these news reports. She can’t use this data directly.
Instead, the expected variable may be something like:
1)
The emotion embedded in this
text: is it a neutral or a negative statement?
2) The frequency or the ratio of a single foreign country being mentioned on a monthly basis
In order to generate these variables, some decisions are being made during this process, but some rationales for making these decisions may not be disclosed in the published article.
The process of replicating research is difficult, even if more and more scholars provide replication files and lengthier appendixes. Multiple choices were being made during the process of data collection, cleaning, re-classification, filling missing data, checking for duplicates, testing, regression models, graphs, etc. Sometimes, we still find it confusing even reading this replication file. For instance, why choose log(x) instead of x? What is the purpose to standardize the variable? Is it necessary to standardize the variable? Why re-classify the income variable from ten categories to three categories? Some choices are made due to their features, such as log(1+x); other choices are made due to their distribution.
For instance, we
want to know how severe protests are in each state. Which states are more
likely to encounter protests? The easiest way is to take the mean value of each
states’ protests. However, from Table 1, we can immediately find that it is not
reasonable to choose the mean value of protests. Judging from Table 1, we can see the data in
NY is really weird. They do not have a lot of protests each month, but
suddenly, there is a surge of protests in June (see Figure 1). The first thing
to do is that; you need to double-check if your data collection process is
flawless. Do you miscalculate protests from other states or do you double count the same incident twice?
Figure 1. The
distribution of annual count of protests across states
If there are not mistakes being made during the collection process, then the next step you need to think about is how to deal with this outlier that differs vastly from the rest of NY’s observations. There are normally three options:
1)
Delete this outlier: it clearly
solves the problem. But there are some weird numbers in Ohio as well, such as
this January’s protest. Do you also want to delete this number? Remember, in this example, we only have 12 observations for 1 state, and deleting an
observation from our already limited sample size is not a good choice.
2)
Take log(x): By logging the monthly counted number of states’ protests, which we refer to as x, suddenly,
the difference between having 10 protests and having 1000 protests shrinks. It
is a better choice than deleting it.
Note: There are different ways to deal with
this problem and it is not only limited to these two options.
Table 1. A fictional dataset of monthly count number of states protests
NY's protest
Texas's protest
Ohio's protest
Jan
1
333
444
Feb
3
66
11
Mar
2
242
66
Apr
5
53
55
May
1
123
87
Jun
1000
42
44
Jul
7
12
163
Aug
2
222
155
Sep
4
456
22
Oct
20
56
50
Nov
1
88
41
Dec
5
96
52
mean
87.58333
149.0833
99.16667
Comments
Post a Comment