A Real-Life Instance of Plagiarism Detection by SCAM

This note describes how SCAM was used in a real-life plagiarism case. Our involvement started when we saw a message in DBWORLD, a large mailing list about a possible case of plagiarism by an author. We felt this was a good opportunity to test SCAM in a real-life scenario.

We downloaded abstracts of other papers written by the same author (henceforth referred to as X), from INSPEC (a database of conference and journal abstracts) and registered them into SCAM. Since we did not know the source of plagiarism (that is from which paper was X's paper copied from), we had to "poll" INSPEC for abstracts in the same field as each of X's abstracts.

We chose several keywords from each of X's abstracts (for instance, we chose "routing" and "VLSI" in one of the VLSI CAD papers), and downloaded all the abstracts returned from the keyword search (usually about 1000 to about 10,000 abstracts). We did not want to use one keyword per abstract and thereby throw away some abstracts that may have had overlap, but did not use the chosen keyword. For instance, X may have chosen to add a certain word like "VLSI" while the original abstract may not have used that word. We then created a union of all downloaded abstracts (based on several keywords) and then used SCAM to compare the union of the downloaded abstracts with the registered abstracts. SCAM returned between 1 and 20 abstracts (at least one because the abstract of X was already in INSPEC, and clearly had 100% overlap) for each of X's abstracts. We then manually examined the flagged abstracts and saw if there was any abstract that had any close relation with X's abstracts.

Overall we manually downloaded about 35,000 abstracts from INSPEC. The downloading from INSPEC to our local workstation took a few hours due to the quirks in INSPEC's mailing system (we could not mail more than about 50 citations at a time). Once the abstracts were downloaded, it took SCAM about 5 minutes to produce the list of possible overlaps. What would normally would have taken a human days or months to perform without SCAM, was reduced to between 15 - 20 minutes for manually examining all the flagged abstracts (about 50) for the initial set of X's 4 abstracts.

We were then in constant touch with Prof. Halatsis (University of Athens, Greece) who sent us more abstracts of papers that X had either put on his CV or had submitted. SCAM found more instances of plagiarism, and the original sources from which each of X's abstracts were copied. We also expanded our search space to the CS-TR database, a database of Computer Science Technical reports from a few universities, in addition to our INSPEC searches. The entire incident was summarized in a message posted on several newsgroups and mailing lists on the Internet.