COSC 419: Learning Analytics

A6: Plagiarism Detection [15 pts]

Due date: Apr 05, 2020, 11:59pm

In this assignment, you will implement simple plagiarism detection algorithms as discussed in class. Specific data sets will be provided to you. What to submit:

Submit the following on Connect:


Specific Instructions for Exercise 1: Implementing containment

Use any programming language you want. (My solution uses Ruby.)

  1. Generate a list of word n-grams for each document in your data set, for n = 1..10 (inclusive)
  2. Select a pair of documents (you can download them from "Resources below"). Call them texts A and B respectively. The specific pairs you will use are:
  3. For the specific text pairs above, compute containment C_n(A,B), for n = 1..10 (inclusive). That is, for n=1, create the set of unigrams in text A (S_A) and the set of unigrams in text B (S_B). Then determine the size of the intersection of the two sets and the size of S_B to calculate C_1(A,B). Next, for n=2, create the set of bigrams S_A and S_B in computing C_2(A,B). Repeat this for each of n=3, ..., 10.
    Note: The underscore in containment means subscript.
  4. Report your results in a table with the following heading:
    Filename 1 Filename 2 C_1(A,B) C_2(A,B) C_3(A,B) C_4(A,B) C_5(A,B) C_6(A,B) C_7(A,B) C_8(A,B) C_9(A,B) C_10(A,B)
    ...
  5. Interpret the results reported in the table and make sure it makes sense to you based on the definitions of word n-grams and containment. Specifically, answer the following questions:
Grading Criteria


Resources: