|
Frequent pattern mining, which
discovers frequent patterns in a database, is an important data mining
problem with broad applications, including the analyses of customer
purchase behavior, web access patterns, scientific experiments, disease
treatments, natural disasters, DNA sequences, and so on [5]. For a
specific example, suppose we are given a database of DNA sequences for
certain species, and a minimum support β, which is a number greater than
0, we want to determine the subsequences which appear more than β times
in the DNA sequence database. In biology, these frequent DNA
subsequences often correspond to a common ancestor or a common virus.
There are many different kinds of
frequent pattern mining problems. In particular, we have frequent
sequential pattern mining problems and frequent subtree pattern mining
problems. Frequent sequential pattern mining discovers frequent
subsequence patterns in a sequence database, while frequent subtree
pattern mining discovers frequent subtree patterns within
semi-structured tree databases. In this project, we are going to study
these two frequent pattern mining problems through an example study,
searching for frequent webpage access patterns.
In order to maximize webpage
access efficiency (such as posting advertisements in certain frequently
accessed webpages), the designers often want to know which webpages are
accessed more often than others. First, we will study a simpler version
of this problem. Note that we can identify each webpage by its unique
URL. If we assume that webpage users never use the “back button” in
their browser when they surf webpages, then each webpage access history
can be viewed as a sequence of URLs. Therefore, the frequent webpage
access pattern problem reduces to the problem of mining frequent
sequential pattern within a sequence database.
Our second approach is more general, we assume that webpage users do use
the back button when they browse webpages, hence each webpage access
history can be viewed as a labeled tree. In this tree, the labels are
the URLs for the webpages, the root is the webpage where the user starts
to surf, and the internal nodes of the tree correspond to the webpages
where users use the “back button” to return to other links in that page.
Applying this model, the frequent webpage access pattern problem is
reduced to the frequent subtree pattern mining problem. Our goal in this
project is to design and implement efficient algorithms to find frequent
webpage access patterns from a given database of access history.
|