COSC 419: Learning Analytics
A4: Project Proposal and Bayes Net Exercise [28 pts]
Due date: Mar 08, 2020, 11:59pm
What to submit:
Submit the following on Connect:
- A PDF report documenting:
- The project proposal
- The Bayes net model definition and CPTs (with rationale)
- The Bayes net structure produced from your Matlab code
- The answers for the inference questions
- If your code doesn't fully work: document which part of the assignment
you were able to finish and which part(s) did not work. In your explanation,
be sure to reference precisely the code files that you got working.
- All the code you wrote: be sure it's well documented so we know which
file or which part of the file does what (a README file would be helpful)
Specific Instructions for
Exercise 1: Project Proposal [8 pts]
Provide a short proposal of what you want to do for the individual course
project. Specifically:
- You may pick from one of the project suggestions (see bottom of this
page) or think of something on your own. To ensure your topic makes sense, fill
in the details as requested below.
- If you pick from a suggested topic:
- Indicate which topic it is
- Specify the output of the project
- Specify the steps needed to complete it
- For each step, provide a milestone deadline for completing it (in a table)
- If you pick your own topic:
- In an overview paragraph, explain what you want to do in the project.
Then mention what specific aspect of it interests you and why you want to do it.
- Specify the output of the project
- Specify the steps needed to complete it
- For each step, provide a milestone deadline for completing it (in a table)
- Ensure you have reviewed this with the instructor before submitting
it, so that the topic you pick is within the scope of this course and of a
reasonable size (similar to the ones in this list of suggested topics).
- If you are unsure about any of the details, contact the instructor well
in advance to discuss.
Grading Criteria
- [1 pts] Statement of interest
- [4 pts] Deatiled steps outlined are clear and make sense
- [1 pts] Within topic scope of the course
- [2 pts] Reasonable size
Specific Instructions for
Exercise 2: Learning Bayes Net Parameters [20 pts]
In A1, you ran a couple of Java programs and generated some data. Here, we are
going to use that data to create a simple Bayes net that is commonly used in
intelligent tutoring systems. The idea of an intelligent tutoring system is
that an individual user works on some practice questions (in this case, math
questions), and the system observes how the user is doing, and potentially
suggests hints to help the user get the right answer if the system believes
the user needs help.
We have created a simple model for such a system. In this model, we have a
variable to model question difficulty being easy or hard (Difficulty = {easy,
hard}), which will influence whether the user gets the answer right or wrong
(Accuracy) and the time it takes to answer the question (Task Time). Both
Accuracy and TaskTime influence whether you need someone to help you (Need
Help). In our example here with an intelligent tutoring system, providing help
to the user means displaying a hint. We can estimate how much help a user
needs based on the time that a hint stays on the screen (Display Time). If the
user needs help and a hint is displayed, then the user will spend a
"reasonable amount of time" reading the hint. If the hint is given but the
user didn't actually need help, the the time spent is much less or much more
than what we would normally expect. (Less, because the hint is up and the user
just closes it without reading; More, because the hint is up and the user is
ignoring it altogether.) The graphical model is drawn in the diagram below.
To make things more consistent, I am defining the remaining random variables
Accuracy, TaskTime, NeedHelp, and DisplayTime with the following values:
- Accuracy = { wrong, right }
- TaskTime = { slow, fast }
- NeedHelp = { false, true }
- DisplayTime = { short, average, long }
At this point, you can now define the structure of the Bayes net in Matlab
using the BNT package. You may want to do this now and check the structure
before moving on. You can do this by viewing bnet.dag where bnet is the Bayes
net you defined, then manually inspect the array elements to make sure the
only elements with 1 is where there is a parent to child relation.
Remember that a Bayes net also has a quantitative component which are the
CPTs. Define the CPT for Pr(Difficulty) with a uniform prior distribution.
Next, you will define the CPTs for Pr(Accuracy|Difficulty) and
Pr(TaskTime|Difficulty) using data collected from A1 (download
all-hard.txt and
all-easy.txt) to figure out the average accuracy
rate and the average task time for each of the easy and hard conditions. These
times are in nanoseconds so you'll want to convert them back to seconds. (The
average times I got are 6.3859 secs with 96% accuracy for the easy condition,
and 30.0690 secs with 89% accuracy for the hard condition.) For Accuracy, just
use the average accuracy to define Pr(Accuracy=right|Difficulty). For
TaskTime, compute the frequency when TaskTime is below the average, then use
that percentage to define Pr(TaskTime=slow|Difficulty). Using the data, when
Difficulty is easy, I got 37% of the times that TaskTime is slow. You can do
the same calculation for the other condition when Difficulty is hard.
After you entered the Bayes net into Matlab, you may want to use the command
"get_field( bnet.CPD{Acc}, 'cpt' )" to display the probabilities that got
input into your model (with variable name bnet) at the index Acc (which I
defined to be 2 as the second variable representing Accuracy).
As we don't have data involving NeedHelp, the last step in creating the model
is to handcraft Pr(NeedHelp|Accuracy,TaskTime) and Pr(DisplayTime|NeedHelp)
with reasonable parameters and explain in English the rationale you used to
define these parameters.
Although we don't have data for DisplayTime, we have some idea of how long
people take to read. Download all-read.txt to
figure out the average time it takes for someone to read a word. (I got 0.2772
sec per word using the data.) Assuming a hint is about 25 words, come up with
the average reading time you would expect for reading a hint. Using that
average, define the range below it that would indicate the person is reading
too fast (so perhaps just closing the hint without reading it) or that the
person is taking too long to read it (so perhaps the hint is just ignored).
Use these bounds to define DisplayTime. Make sure you state clearly what
bounds you use in your definition because you will need this for the query
below.
Once you have encoded the above network:
Let's say the system put up a hint and we observe that the DisplayTime for
that hint was 10 seconds. Now, what is Pr(NeedHelp=true|DisplayTime=long) = ?
In my model, 10 seconds corresponds to a value that is too long for an average
person to spend reading a hint. Figure out what 10 seconds means in your
model based on your variable definition and adjust your query accordingly.
Grading Criteria
- [1 pt] CPT for Pr(Difficulty)
- [4 pts, 2 pts each] scripts to compute accuracy and task completion time averages
- [2 pts] script to estimate the average reading time for a hint
- [4 pts, 2 pts each] CPTs for Pr(Accuracy|Difficulty) and Pr(TaskTime|Difficulty)
- [2 pts] reasonable definition for Pr(NeedHelp|Accuracy,TaskTime) with rationale
- [2 pts] reasonable definition for Pr(DisplayTime|NeedHelp) with rationale
- [2 pts] script to compute the average reading time for a word, and
clear definition of the bounds of what constitutes DisplayTime being too
short, average, and too long in Matlab
- [2 pts] script to encode the Bayes net in Matlab
- [1 pts] Matlab script and values for the probability that the user
needs help is true when a hint was read for 10 seconds
Note: You may have just one Matlab file with all your code in it, rather than
separate Matlab scripts.
Resources:
Project Suggestions
- Implement the junction tree inference algorithm according to the steps
defined in ``Inference in Belief Networks:
A Procedural Guide'' (Huang and Darwiche, 1996). Test it out on a sample
Bayes net. Note that this option is quite involved so building the secondary
structure successfully will get you an A- in the project. Getting the
propagation part working correctly will get you more marks. Also: don't use
a language that has types (e.g. Java), it will cause you a lot of pain with
the potentials.
[Group of 2 version:] The above steps and completing the propagation part.
- [Group of 2]
Implement a simple version of GPLAG that automatically creates a
dependency graph given an input source code. Test your code appropriately
and show clearly that it works. For each pair of input source code, and
compute the distance between their corresponding graphs. Show your results
for the following test cases and provide the associated Java code:
(i) when 2 programs are obviously different,
(ii) when 2 programs are obviously similar/plagiarized from each other,
(iii) when it's unclear (by human eye) whether 2 programs are similar but
their dependencies show that they actually are similar, and
(iv) when it's unclear (in any way) whether 2 programs are similar.
-
Use an API (such as LinkedIn) for learning skill sets from users and
job postings. Narrow your focus to specific types of users (e.g., ages 20-25
and job postings in one geography). Do simple data analyses to visualize
user skills (e.g., most common) and skills needed in job postings (e.g.,
most sought after skills).
[Group of 2 version:] Identify users who are similar to each other and
jobs that are similar to each other. Finally, conduct a simple analysis
between the two sets of skills to identify whether users have the skills
required by today's jobs. (Be sure to clearly explain your analysis methods,
why you chose those methods, and the results you found.)
- Explore a social media API of your choice to gather actual data from
specific user accounts to interpret learning information (e.g. identify
topics students have trouble with, identify the major complaint topics,
identify major and positive contributers to discussions, identify types of
contributions made in discussions). Create a visual dashboard for an
administrator to see these analyses.