Our Methods
We
intend to complete the following two tasks in the project:
1.
|
First, we are
going to study the frequent sequence pattern mining problem for frequent
webpage access patterns. As mentioned earlier, each webpage access history
can be viewed as a sequence of URLs. In this problem, we are given a
database of webpage access history and a minimum support β. There are two
ways described in the data mining literature to find frequent subsequence
patterns which appear more that β times in the database. The first approach
is called candidate-generation-and-test. Basically, it generates a candidate
sequence, and then it tests whether this candidate sequence appears more
than β times in the database. So the problem is reduced to testing whether
the candidate sequence is a subsequence of the sequences in the database.
This can be done using dynamic programming, which is very useful yet
normally very challenging for undergraduate students. The second approach is
called prefix-projected pattern growth. The idea is to grow the frequent
patterns from its sub-patterns. The main idea behind this is the following:
if a sequence is a frequent pattern in the database, then any of its
subsequences is also a frequent pattern in the database. Therefore, every
frequent pattern can be grown from its sub-patterns. In this approach, we
will use the technique of projection, which decomposes and projects the
database according to the sub-patterns. More simply, we must decompose each
sequence into two subsequences if the sub-pattern happens to be a
subsequence of the original sequence. |
2.
|
Our second task is more
general, we use trees to model webpage access history, and we will find all
frequent webpage access patterns. In order to use this method for frequent
sequence pattern mining, we first try to encode a tree into a
sequence, which can also be decoded to the original tree. If we
can do this, then we can reduce this problem to our first task. We intend to
use the preorder-string encoding scheme to encode our tree [1]. The
pre-order string of a rooted tree T is defined recursively as the following:
(1) for a rooted tree T with a single node r, the pro-ordering string of T
is S(T)=r-1, where r is the label for the single node r and “-1” is called
end flag; and (2) for a rooted tree T with more than one node, assuming the
root of T is r and the children of r are r1,
r2,
…, rk,
then the pre-order string for T is S(T)= r S(T1)…S(Tk)-1,
in which S(Tk)
is the pre-order string for the subtree rooted at rk.
This definition uses recursion and it is proved to be asymptotically
optimal. For example, for the following tree, its pre-order string is
AB-1C-1-1. Then, we will use techniques similar to
candidate-generation-and-test and prefix-projected pattern [2, 3, 4] growth
to tackle this problem. |
Specific Hypothesis
Given a sequence database and a
tree database for webpage access history, and a minimum support β>0, output all
the frequent subsequence patterns which appear more than β time in the database.
We assume the input database is text-based.
|