UAH
Computer Science
Research Project

 

 

 

Finding Frequent Webpage Access Patterns

Upcoming Events: TESTING PHASE....

 

Our Methods

We intend to complete the following two tasks in the project:

    1.

    First, we are going to study the frequent sequence pattern mining problem for frequent webpage access patterns. As mentioned earlier, each webpage access history can be viewed as a sequence of URLs. In this problem, we are given a database of webpage access history and a minimum support β. There are two ways described in the data mining literature to find frequent subsequence patterns which appear more that β times in the database. The first approach is called candidate-generation-and-test. Basically, it generates a candidate sequence, and then it tests whether this candidate sequence appears more than β times in the database. So the problem is reduced to testing whether the candidate sequence is a subsequence of the sequences in the database. This can be done using dynamic programming, which is very useful yet normally very challenging for undergraduate students. The second approach is called prefix-projected pattern growth. The idea is to grow the frequent patterns from its sub-patterns. The main idea behind this is the following: if a sequence is a frequent pattern in the database, then any of its subsequences is also a frequent pattern in the database. Therefore, every frequent pattern can be grown from its sub-patterns. In this approach, we will use the technique of projection, which decomposes and projects the database according to the sub-patterns. More simply, we must decompose each sequence into two subsequences if the sub-pattern happens to be a subsequence of the original sequence.

     

    2.

    Our second task is more general, we use trees to model webpage access history, and we will find all frequent webpage access patterns. In order to use this method for frequent sequence pattern mining, we first try to encode a tree into a sequence, which can also be decoded to the original tree. If we can do this, then we can reduce this problem to our first task. We intend to use the preorder-string encoding scheme to encode our tree [1]. The pre-order string of a rooted tree T is defined recursively as the following: (1) for a rooted tree T with a single node r, the pro-ordering string of T is S(T)=r-1, where r is the label for the single node r and “-1” is called end flag; and (2) for a rooted tree T with more than one node, assuming the root of T is r and the children of r are r1, r2, …, rk, then the pre-order string for T is S(T)= r S(T1)…S(Tk)-1, in which S(Tk) is the pre-order string for the subtree rooted at rk. This definition uses recursion and it is proved to be asymptotically optimal.  For example, for the following tree, its pre-order string is AB-1C-1-1. Then, we will use techniques similar to candidate-generation-and-test and prefix-projected pattern [2, 3, 4] growth to tackle this problem.

 

Specific Hypothesis

Given a sequence database and a tree database for webpage access history, and a minimum support β>0, output all the frequent subsequence patterns which appear more than β time in the database. We assume the input database is text-based.

 

Comment Book | Add a Comment | Search | References | Implementation_Code.htm

Copyright © 2006 CSRP
All Rights Reserved