“Are you a good witch or a bad witch?” – Glinda, to Dorothy in The Wizard of Oz.
At bepress we ask a very similar question thousands of times each day as we handle the requests for content downloads and determine if they are from a bot or from a human reader. Some bots are good bots that clearly identify themselves, such as the Google Crawler. These are easy to identify and remove from readership reports. Other bots are bad bots because they don’t identify themselves as bots. In fact, many go the other way and attempt to pass themselves off as human readers. To maintain accurate download counts we must identify those performed by bots and exclude them from the counts that we report to authors and administrators. Identifying and excluding bad bots creates a regular source of work for us as we strive to provide the most accurate download counts in the industry.
For those unfamiliar with the term, bot (derived from “robot”) is shorthand for a program designed for automating tasks on the web. As mentioned above, the web crawler that Google uses to find and index content is a bot. The programs written and deployed to wreak havoc on websites and networks in protest (or for nefarious hacker fun) are also bots. Bots aren’t necessarily bad, but their activities should never be considered as being done by a human.
Starting in April of this year, we noticed a dramatic increase in the number of downloads to content across all of the Digital Commons sites. By having access to the data of hundreds of sites, we could quickly determine that this was not just a blip in the data but a doubling of download activity over a six week period. We dropped what we were doing and focused the development team on identifying what this traffic was and where it came from. We quickly found that we were being crawled by a number of new, sophisticated bots originating from Ukraine, Poland, and China that out-smarted our existing bot filtering algorithms.
For the past six weeks we have been working on a two-pronged solution to this problem. First, we developed rules that would catch these bots in real time in the future. This was an iterative process requiring us to process large volumes of historical data to find trends across sites and to build filters to identify and filter that activity. Then we tested these new filters across multiple Digital Commons instances to evaluate the effectiveness of our new rules. Secondly, we applied those rules to the previously recorded download counts for April and May to eliminate any that were erroneously recorded.
This is not the first time we’ve done work like this. Occasionally we find, or have had reported to us, a sharp increase in download activity and we have done similar projects to respond to the situation. However, these reports have generally been site-specific. This latest attack is the largest and most widespread we’ve seen. Because of its invasiveness we wanted to take extra care in developing the most thoughtful response so that we are better able to prevent this activity going forward.
You may be asking why this is so important. We, and many of our subscribers, believe that download counts are an important measure of the effectiveness and reach of the Digital Commons collections. As impact of research moves from the traditional number of citations to other alternative metrics, we believe that downloads are an integral part of that story.