COSC 419: Learning Analytics
A6: Plagiarism Detection [15 pts]
Due date: Apr 05, 2020, 11:59pm
In this assignment, you will implement simple plagiarism detection algorithms
as discussed in class. Specific data sets will be provided to you.
What to submit:
Submit the following on Connect:
- A PDF report documenting:
- Output of all n-grams for each text file for n=1...10 (inclusive)
- The table of results
- Written answers for interpreting your results
- If your code doesn't fully work: document which part of the assignment
you were able to finish and which part(s) did not work. In your explanation,
be sure to reference precisely the code files that you got working.
- All the code you wrote: be sure it's well documented so we know which
file or which part of the file does what
Specific Instructions for
Exercise 1: Implementing containment
Use any programming language you want. (My solution uses Ruby.)
- Generate a list of word n-grams for each document in your data set, for
n = 1..10 (inclusive)
- Select a pair of documents (you can download them from "Resources
below"). Call them texts A and B respectively. The specific pairs you will
use are:
- 1.txt and 2.txt
- 1.txt and 3.txt
- 2.txt and 6.txt
- 3.txt and 4.txt
- 4.txt and 5.txt
- 3.txt and 5.txt
- 6.txt and 3.txt
- 6.txt and 4.txt
- 6.txt and 5.txt
- For the specific text pairs above, compute containment C_n(A,B),
for n = 1..10 (inclusive). That is, for n=1, create the set of unigrams in
text A (S_A) and the set of unigrams in text B (S_B). Then determine the
size of the intersection of the two sets and the size of S_B to calculate
C_1(A,B). Next, for n=2, create the set of bigrams S_A and S_B in computing
C_2(A,B). Repeat this for each of n=3, ..., 10.
Note: The underscore in containment means subscript.
- Report your results in a table with the following heading:
Filename 1 |
Filename 2 |
C_1(A,B) |
C_2(A,B) |
C_3(A,B) |
C_4(A,B) |
C_5(A,B) |
C_6(A,B) |
C_7(A,B) |
C_8(A,B) |
C_9(A,B) |
C_10(A,B) |
... |
- Interpret the results reported in the table and make sure it makes
sense to you based on the definitions of word n-grams and containment.
Specifically, answer the following questions:
- Which text pairs are definitely plagiarized pairs? You can read the
paragraphs in the original files to verify this.
- Which text pairs are similar enough that should be worth looking into
further by a human?
- What happens to C_n(A,B) as n increases?
- For paragraphs this short, what is a good value of n to use?
Justify your choice based on the data in the table.
- For the n you've chosen in the previous question, what is a good
value of C_n(A,B) to use (for your text sample)?
Justify your choice based on the data in the table.
Grading Criteria
- [7 pts] code and output for extracting n-grams for n=1...10 (inclusive)
- [5 pts] correctly computing C_n(A,B) for all text pairs in table format
- [3 pts] for written answers
Resources:
- Set of documents to use for exercise 1: