- #1
ExNihilo
- 33
- 0
Homework Statement
I would like to determine the upper outliers in a dataset where the distribution is NOT normal. The dataset represents a the number of page viewed for each IP Address. Basically, when a web page is viewed by human users, the IP address has very few hits (1 to 3). While IP addresses of Web crawlers make a lot of page hit.
- IP 1 = 700 views
- IP 2 = 650 views
...
- IP n1 = 50 views
- IP n2 = 45 views
...
IP n3 = 3 views
IP n4 = 2 views
IP n5 = 1 view
The sample represents a few thousand unique IP, each with its page view value. It is very possible that different IP Address made the same number of page viewed. I would like to use a statistical method to determine a threshold value which separates the crawlers from normal users.
Thank you very much in advance for any advice.
Homework Equations
(none)
The Attempt at a Solution
The data distribution is not normal. I am not sure if the method using Standard Deviation would apply. Searching on the net, a possible solution
could be the "Interquartil Range" (IQR) method: http://krishnadagli.blogspot.com/2008/05/learning-statistics-using-rdetecting.html
I am not sure if this could apply well to my scenario. Can you please confirm or complete?