COSC 419: Learning Analytics

A6: Plagiarism Detection [15 pts]

Due date: Apr 05, 2020, 11:59pm

In this assignment, you will implement simple plagiarism detection algorithms as discussed in class. Specific data sets will be provided to you. What to submit:

Submit the following on Connect:

A PDF report documenting:
- Output of all n-grams for each text file for n=1...10 (inclusive)
- The table of results
- Written answers for interpreting your results
- If your code doesn't fully work: document which part of the assignment you were able to finish and which part(s) did not work. In your explanation, be sure to reference precisely the code files that you got working.
All the code you wrote: be sure it's well documented so we know which file or which part of the file does what

Specific Instructions for Exercise 1: Implementing containment

Use any programming language you want. (My solution uses Ruby.)

Generate a list of word n-grams for each document in your data set, for n = 1..10 (inclusive)
Select a pair of documents (you can download them from "Resources below"). Call them texts A and B respectively. The specific pairs you will use are:
- 1.txt and 2.txt
- 1.txt and 3.txt
- 2.txt and 6.txt
- 3.txt and 4.txt
- 4.txt and 5.txt
- 3.txt and 5.txt
- 6.txt and 3.txt
- 6.txt and 4.txt
- 6.txt and 5.txt
For the specific text pairs above, compute containment C_n(A,B), for n = 1..10 (inclusive). That is, for n=1, create the set of unigrams in text A (S_A) and the set of unigrams in text B (S_B). Then determine the size of the intersection of the two sets and the size of S_B to calculate C_1(A,B). Next, for n=2, create the set of bigrams S_A and S_B in computing C_2(A,B). Repeat this for each of n=3, ..., 10.
Note: The underscore in containment means subscript.

Report your results in a table with the following heading:

Filename 1	Filename 2	C_1(A,B)	C_2(A,B)	C_3(A,B)	C_4(A,B)	C_5(A,B)	C_6(A,B)	C_7(A,B)	C_8(A,B)	C_9(A,B)	C_10(A,B)
...

Interpret the results reported in the table and make sure it makes sense to you based on the definitions of word n-grams and containment. Specifically, answer the following questions:
- Which text pairs are definitely plagiarized pairs? You can read the paragraphs in the original files to verify this.
- Which text pairs are similar enough that should be worth looking into further by a human?
- What happens to C_n(A,B) as n increases?
- For paragraphs this short, what is a good value of n to use? Justify your choice based on the data in the table.
- For the n you've chosen in the previous question, what is a good value of C_n(A,B) to use (for your text sample)? Justify your choice based on the data in the table.

Grading Criteria

[7 pts] code and output for extracting n-grams for n=1...10 (inclusive)
[5 pts] correctly computing C_n(A,B) for all text pairs in table format
[3 pts] for written answers

Resources:

Set of documents to use for exercise 1: