Determine Outliers in a non-normal distribution

In summary: If so, try to include the median value on the boxplot, and also the IQR value.The one problem I have with that is this: the method I outlined, where you determine the location of outliers by using a fixed distance above the third quartile and below the firstquartile ``protects'' the central 50% of the data from being identified as an outlier (as does your idea) but also makes works on the assumption that outliers begin a specific distance from those quartiles - the cutoffs for large and small outliers are equally distant from the upper/lower quartiles. Your procedure would not necessarily set the distance equally far from each quartile, only equally distant from the center of the data.
  • #1
ExNihilo
33
0

Homework Statement



I would like to determine the upper outliers in a dataset where the distribution is NOT normal. The dataset represents a the number of page viewed for each IP Address. Basically, when a web page is viewed by human users, the IP address has very few hits (1 to 3). While IP addresses of Web crawlers make a lot of page hit.

- IP 1 = 700 views
- IP 2 = 650 views
...
- IP n1 = 50 views
- IP n2 = 45 views
...
IP n3 = 3 views
IP n4 = 2 views
IP n5 = 1 view

The sample represents a few thousand unique IP, each with its page view value. It is very possible that different IP Address made the same number of page viewed. I would like to use a statistical method to determine a threshold value which separates the crawlers from normal users.


Thank you very much in advance for any advice.


Homework Equations


(none)


The Attempt at a Solution


The data distribution is not normal. I am not sure if the method using Standard Deviation would apply. Searching on the net, a possible solution
could be the "Interquartil Range" (IQR) method: http://krishnadagli.blogspot.com/2008/05/learning-statistics-using-rdetecting.html

I am not sure if this could apply well to my scenario. Can you please confirm or complete?
 
Physics news on Phys.org
  • #2
The you mention will work, but looking at the vast differences in sizes of the measurements you may find a huge number of outliers.
If you have access to a program that creates boxplots of numerical data (Minitab, R) the outliers will appear as asterisks at the upper end of the boxplot: the upper whisker will extend only as high as it can go without locating an outlier.

Just a comment: for data that are normally distributed there is a link between this method and the mean and standard deviation. The first and third quartiles of a normal distribution are roughly 0.67448 standard deviations above and below the mean, so the IQR for the normal distribution is roughly 1.34896 times the standard deviation.
Saying an outlier is any value more than 1.5 IQR from the first or the third quartile is the same as saying the limit is about 2.02 standard deviations - for normally distributed data, at least, the IQR method and the usual standard deviation methods are comparable.

I know your data is not normal, but the comparison in the normal case can be helpful in seeing what the motivation is.
 
  • #3
Hi,

Thank you for your advice. The data sample I am dealing with is closer of a "cliff shape"
than a bell curve. I didn't start any serious work yet so I cannot confirm how accurate the IQR method could be.

However I plan to use the median (not mean) value as a control value. Just a way to give the date frequency some influence. Something like if 1.5 IQR is way above Median value then it is a reliable threshold. Do you think that the Median value could be used to arbitrate usefully in making a decision? If so, can you suggest some directions where I can develop further?

Thanks in advance.
 
  • #4
ExNihilo said:
Hi,

Thank you for your advice. The data sample I am dealing with is closer of a "cliff shape"
than a bell curve. I didn't start any serious work yet so I cannot confirm how accurate the IQR method could be.

However I plan to use the median (not mean) value as a control value. Just a way to give the date frequency some influence. Something like if 1.5 IQR is way above Median value then it is a reliable threshold. Do you think that the Median value could be used to arbitrate usefully in making a decision? If so, can you suggest some directions where I can develop further?

Thanks in advance.

The one problem I have with that is this: the method I outlined, where you determine the location of outliers by using a fixed distance above the third quartile and below the firstquartile ``protects'' the central 50% of the data from being identified as an outlier (as does your idea) but also makes works on the assumption that outliers begin a specific distance from those quartiles - the cutoffs for large and small outliers are equally distant from the upper/lower quartiles. Your procedure would not necessarily set the distance equally far from each quartile, only equally distant from the center of the data. In short, my fear is that you would identify too many values as outliers.
Have you constructed boxplot? (It would be a good choice since most programs that provide them show outliers on the plot)
 

Related to Determine Outliers in a non-normal distribution

1. What is the definition of an outlier in a non-normal distribution?

An outlier in a non-normal distribution is a data point that is significantly different from the rest of the data points in the distribution. It is an observation that falls outside of the expected range and can potentially skew the results of the analysis.

2. How do you determine outliers in a non-normal distribution?

There are various methods for determining outliers in a non-normal distribution, including the use of z-scores, box plots, and the interquartile range (IQR) method. These methods involve identifying data points that fall beyond a certain threshold or outside of a specific range.

3. Can outliers be removed from a non-normal distribution?

Yes, outliers can be removed from a non-normal distribution. However, this should be done with caution and only after careful consideration of the potential impact on the data and the analysis results. Removing outliers may alter the overall distribution and potentially change the interpretation of the data.

4. Are outliers always bad or should they be kept in the analysis?

Outliers are not necessarily always bad and should not always be removed from the analysis. In some cases, outliers can provide valuable insights and information about the data. It is important to carefully consider the context and potential impact of outliers before deciding whether to keep or remove them from the analysis.

5. Can a normal distribution have outliers?

Yes, a normal distribution can have outliers. However, the presence of outliers in a normal distribution may indicate that the data is not truly normally distributed or that there are underlying factors affecting the data. It is important to investigate and understand the cause of the outliers in a normal distribution.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Feedback and Announcements
Replies
0
Views
95K
  • Computing and Technology
Replies
4
Views
3K
  • Precalculus Mathematics Homework Help
Replies
1
Views
1K
Replies
1
Views
3K
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
5K
Replies
2
Views
3K
  • Special and General Relativity
Replies
32
Views
5K
Back
Top