Outliers and their Causes
Outliers are data points, sets of data or observations that fall far outside the normal variable population (Osborne & Overbay, 2004). Such data is inconsistent with the majority of the intended population or the variable range. It can be brought about by an experimental error or special cases of heavily skewed distribution which in such a case the assumption of normal distribution may be void. In other cases, it is a chance phenomenon (Hawkins, 2014). Some statistical estimators and calculations can deal with the occurrence of outliers while others cannot.
Causes of Outliers
According to Osborne & Overbay (2004), outliers can arise from either error made in data collection, recording or entry. Such errors can be corrected by returning to the original documents or the subjects and getting the correct values. For example, when conducting a study of the variation of self-esteem with age, and an entry on the age of students in a university leads to a wrong entry of one student being 3 years old. This is an obvious outlier that can be reviewed by the original subjects to represent a valid figure. Other outliers can be due to intentional misreporting to sabotage or influence the results, sampling error, standardization failure, faulty distribution assumptions (Rosenfeld & Penrod, 2013). Others can even be as a legitimate and from the correct population (Osborne & Overbay, 2004). For instance, in the same case of the university where the student's age range between 18 to 24 years with many of them falling between 14 and 40 then if there is a participant who is legitimately aged 75 years, it becomes an outlier sampled from the right population.
Effects of Outliers on Statistical Analysis
To researchers, the outliers play a major role in the statistical analysis. They weaken the power of statistical tests such the standard deviation and increasing the variance error. They significantly influence the magnitude of correlation hence making them inaccurate. If they are not randomly distributed they end up decreasing the normality of the distribution (Osborne & Overbay, 2004). The mean, for instance, in the above example increases or decreases depending on the position of the age outlier. To the data distribution, if the outlier is significantly below the range, then the distribution will be skewed towards the left whereas if the outlier is significantly higher than the range the distribution will be skewed towards the right (Hawkins, 2014).
Impact on Statistical Measures
The mean and median are affected by outliers while the mode is rarely affected. Taking the age example above, the 75-year-old student will increase the mean age of the student since the mean depends on the average of all the subjects (Rosenfeld & Penrod, 2013). In some cases, it also alters the position of the median. Let’s take an example of a distribution of the age of the university students to be 14, 15, 18, 18, 18 19, 20, 21, 22, 23, 24, 26, 30 32, and 75. With the presence of the outlier 75, then the median is 21 but if the outlier was not present then the median would have been the average of 20 and 21. However, this is in a few cases because median exists in the center of any given set of numbers hence if the number of subjects was fixed then it would not have been affected. In the case of the mean, it is not resistant and will tend to move towards the outlier. If we replaced the age 75 to 33 which fall in the range the mean would be 22.2, however with the outlier the mean moves to 25.
Identifying and Handling Outliers
To identify the outliers, a researcher needs to examine data for any skewed data points that are influential. Researchers are at liberty alter, remove or not to remove the outliers especially the legitimate ones. According to Osborne & Overbay (2004), there is a benefit in the removal of the outliers based on the fact that there were significant effects of accuracy and error rates in the correlation and t-tests. However, some researchers consider accommodating the outliers using “robust” methods in order to maintain the real picture of the study. For instances in the case of univariate distributions, researchers can use a trimmed mean or truncation. In our case, this might involve making an assumption that there cannot be a 75-year old person at the university and trim this age to a reasonable highest value of maybe 40.
References
Hawkins, D. (2014). Identification of outliers. Amsterdam: Springer.
Osborne, JasonW. "AmyOverbay (2004). The power of outliers (and why researchers should
always check for them). Practical Assessment, Research "Evaluation. North Carolina: North Carolina State University.
Rosenfeld, B., " Penrod, S. D. (2013). Research methods in forensic psychology.
Hoboken: Wiley.