Mining Information Search Pattern on Website: A Case Study of Educational Institution

Chapter 1

Introduction

Nowadays every organization either small or large scale must have their own websites. According to Wikipedia. A website or Web site is a collection of related network web resources, such as web pages, multimedia content, which are typically identified with a common domain name, and published on at least one web server website is the largest form of online services that distribute global information service centers in various aspects. Currently, various organizations, companies, and institutions use the website to publish information about their programs, activities, and other relevant things so that they can be easily accessed by the public. The increasing amount of internet usage encourages organizations, companies, and other institutions to engage with the public through various online media. The website is an example of an online information channel for those institutions, enabling visitors to search for detailed information.

One of the examples is the usage of the website in the educational institution selection process as media to communicate, share, and expand important information which is required by students and other related parties. Information published on the official website of the educational institution selection process will be more trusted by the website visitors. This is due to the accuracy and the credibility as the website is supervised directly by the management of the institution. Extracting the pattern of website visitors’ behavior is important to find out the trend on information searching and page accessing. Thus, the website can be improved to have better accessibility and user performance. However, the extracting pattern on an educational website is not often conducted while an educational website is considered as useful. Visitors will visit webpages that match their needs and interests in certain information. From those differences, patterns of visitors’ behavior in accessing information can be identified so that the website can be improved to have better accessibility and user performance. However, the large number of web data makes the pattern discovery more complicated if it is done manually. Therefore, web usage mining can be used to extract hidden information from web data.

Chapter 2

Web Mining

Data-Mining-on-Web

Data Mining on Web

Web mining is the process of using data mining techniques and algorithms to extract information directly from the Web; by extracting it from Web documents and services, Web content, hyperlinks, and server logs. The goal of Web mining is to look for patterns in Web data by collecting and analyzing information in order to gain insight into trends, the industry, and users in general. According to Wikipedia, Web mining is the application of data mining techniques to discover patterns from the WorldWide Web. As the name proposes, this is information gathered by mining the web. It makes utilization of automated apparatuses to reveal and extricate data from servers and web2reports, and it permits organizations to get to both organized and unstructured information from browser activities, server logs, website and link structure, page content, and different sources. Web mining is an abstract as there are three different types of techniques of mining.

  • Web content mining
  • Web structure mining
  • Web usage mining

2.1 Techniques of Mining

2.1.1 Web content mining

Data from the web pages are extracted in order to discover different patterns that give significant insight. There are many techniques to extract the data like web scraping (for instance – scrapy and Octoparse are the well-known tools that perform the web content mining process.

One of the best example – In order to conduct an event or any program, first the organization analyze about the locations (which location is best suited to conduct the program so that there will be full attendance). In order to perform these analyses, one has to gather location-specific information about the city, state and how far the event from the invitee is it’s located. Any location-specific data can be extracted from the web. That’s where the web content mining comes into the picture.

2.2 Web Structure mining

Data from hyperlinks that lead to different pages are gathered and prepared in order to discover a pattern. In order to view a person’s public profile from a blog or any other webpage, there are chances that they would embed their social media links. So, the data is not only extracted from a single source but also from the nested pages through the hyperlinks associated with each page. There are various algorithms to perform this.

2.3 Web usage mining:

When a web application is hosted, there are plenty of web server logs that get generated about the application’s user web activity. These logs are considered as raw data in return meaningful data are extracted and patterns are identified.
For instance, for any e-commerce business, when they want to increase the scope of business or add enhancement for better customer experience, the user’s web activity through the application logs are monitored and data mining is applied to it. We will know more about Web usage mining in the next chapter.

Chapter 3

Methodology

3.1    Web Usage Mining:

Web Usage Mining

Web Usage Mining

World Wide Web is a growing collection of large amounts of information and usually, a great portion of time is needed to identify the appropriate information, so various techniques are needed to analyze the data. One of the techniques used is Web mining. Using Web mining, we can analyze and discover useful information from the web. Web Usage Mining (WUM) extracts useful information based on users’ needs from web log information. Based on the user needs and likes, WUM gives the appropriate information using the web server logs. To extract and process the information, web usage mining follows two main steps by [1] [2]: Data preprocessing and Pattern discovery. The huge data present on the web is a collection of raw data, so to get the user needed information the web data preprocessing should be done. The different phases in web usage mining include data cleaning, data preparation, user identification, session identification, data integration, data transformation, pattern discovery, and pattern analysis. The data preprocessing is the most critical phase in the WUM. The preprocessing of data can be done on the original data or on the data integrated from multiple sources. The purpose of web usage mining is to discover hidden information from weblog data, so we have to mine the data from log files. Log files provide information about the activity of the user, viz., which web site he/she using, whom you send/receive an e-mail, etc. These files are maintained by the system administrator.

3.2    Pattern Mining

Pattern mining is a method in data mining which is used for discovering interesting pattern from the dataset. There are several methods in pattern mining, one of the most commonly used pattern mining methods is association rules.

3.2.1 Association Rules

Association rules are if-then statements that help to show the probability of relationships between data items within large data sets in various types of databases. Association rule mining has a number of applications and is widely used to help discover sales correlations in transactional data

Association rule mining, at a basic level, involves the use of machine learning models to analyze data for patterns, or co-occurrence, in a database. It identifies frequent if-then associations, which are called association rules. An association rule has two parts: an antecedent (if) and a consequent (then). An antecedent is an item found within the data. A consequent is an item found in combination with the antecedent.

Association rules are created by searching data for frequent if-then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the data. Confidence indicates the number of times the if-then statements are found true. A third metric, called the lift, can be used to compare confidence with expected confidence.

Association Rules

Association rules are calculated from itemsets, which are made up of two or more items. If rules are built from analyzing all the possible itemsets, there could be so many rules that the rules hold little meaning. With that, association rules are typically created from rules well-represented in data.

3.2.2  FP-Growth Algorithm

The Frequent Pattern (FP)-Growth method is used with databases and not with streams. The Apriori algorithm needs n+1 scans if a database is used, where n is the length of the longest pattern. By using the FP-Growth method, the number of scans of the entire database can be reduced to two. The algorithm extracts frequent itemsets that can be used to extract association rules. This is done using the support of an item set. The main idea of the algorithm is to use a divide and conquer strategy.

The construction of an FP-tree is subdivided into three major steps.

  • Scan the data set to determine the support count of each item, discard the infrequent items and sort the frequent items in decreasing order.
  • Scan the data set one transaction at a time to create the FP-tree. For each transaction:
    • If it is a unique transaction form a new path and set the counter for each node to 1.
    • If it shares a common prefix itemset then increment the common itemset node counters and create new nodes if needed.
  • Continue this until each transaction has been mapped unto the tree.
FP-Growth Algorithm

FP-Growth Algorithm

Chapter 4

IMPLEMENTATION

The main idea of this study is to discover the hidden pattern of information searching on a website owned by an educational institution by implementing frequent pattern mining algorithm. There were three main steps in this study. The first one was data pre-processing to clean weblog data by eliminating irrelevant elements. Next, pattern discovery was done by implementing the FP-Growth algorithm to discover the hidden pattern of frequent itemset which is the frequently visited page on the website. The last step was the analysis of the pattern and giving the recommendation for the educational institution to improve the quality and the effectiveness of the website.

4.1    Preprocessing Weblog Data

Data pre-processing was done on web log data that had been collected in the beginning. There were five steps of data pre-processing which are data cleaning, user identification, session identification, page views categorization, and path completion. The information available on the web is heterogeneous and unstructured. Therefore, the preprocessing phase is a prerequisite for discovering patterns. The goal of preprocessing is to transform the raw clickstream data into a set of user profiles. Data preprocessing presents a number of unique challenges that led to a variety of algorithms and heuristic techniques for preprocessing tasks such as merging and cleaning, user and session identification, etc. Various research works are carried in this preprocessing area for grouping sessions and transactions, which is used to discover user behavior patterns.

4.1.1 Data Cleaning

Data Cleaning is a process of removing irrelevant items such as jpeg, gif files or sound files and references due to spider navigations. Improved data quality improves the analysis of it. The Http protocol requires a separate connection for every request from the webserver. If a user requests to view a particular page along with server log entries graphics and scripts are downloaded in addition to the HTML file.

4.1.2  User Identification

Identification of individual users who access a web site is an important step in web usage mining. Various methods are to be followed for the identification of users. The simplest method is to assign different user id to different IP addresses. But in Proxy servers, many users are sharing the same address and the same user uses many browsers. An Extended Log Format overcomes this problem by referrer information and a user agent. If the IP address of a user is the same as previous entry and the user-agent is different than the user is assumed as a new user. If both IP address and user-agent are the same then referrer URL and site topology is checked. If the requested page is not directly reachable from any of the pages visited by the user, then the user is identified as a new user in the same address. The caching problem can be rectified by assigning a short expiration time to HTML pages enforcing the browser to retrieve every page from the server.

4.1.3  Session Identification

A user session can be defined as a set of pages visited by the same user within the duration of one particular visit to a web-site. A user may have single or multiple sessions during a period. Once a user was identified, the clickstream of each user is portioned into logical clusters. The method of portioning into sessions is called as Sessionization or Session Reconstruction. In this study, the time set for one session used 30 minutes as the default session timeout.

4.1.4  Page View Categorization

Categorization is done based on the pages that are in the same tab on the website. Therefore, this step was done to classify the page views based on the tabs.

4.1.5 Path Completion

This is the final step to acquire the entire path of user access. Later, data will ready to process in finding the hidden pattern by the FP-Growth algorithm.

4.2 Pattern Discovery

The first step done in the pattern discovery process was to make a relation matrix between tab views and session identification. Before the matrix was made, the data were transformed into the nominal type that shows “click” (1) or “no-click” (0). The transformation process was done to scan the database and calculate support and confidence value more quickly. The matrix was made to recognize the relationship between tab views on the website more easily. Moreover, it will help to track visitors’ behavior based on the searching path on the website. The alphabet A to L represents the identity of page views on the website. The relation matrix between page views and session identification is shown in the table below.

Table I: Tab Views and Session Identification Matrix
SessionABCDEFGHIJKL
Session 1000001010001
Session 2010001010001
Session 3000001010000
Session 4010000000010
Session 5000001001010
Session 6000001010000
Session 7010001000000
Session 8000001001000
Session 9000000011000
Session 10010000011011

FP-Growth algorithm was applied to discover the pattern based on the relation matrix. The minimum support value was set first, then the association page viewed by visitors on the website was discovered afterward.

Chapter 5

RESULTS AND DISCUSSION

Weblog data of educational institutions in February, March, and April were compared to extract hidden patterns in information search. Minimum support and minimum confidence determined in the early process respectively are 10% and 40%. The result shows that there are various rules found, shown in Table II, Table III, and Table IV. Based on the comparison of three months above, there are some rules which have seen similar patterns. Nevertheless, many different patterns are found each month. In February, they are three distinctive patterns found. Home, Info, and News tabs are three frequent tabs visited. The rules {Home, Info} => {News}, {New}=>{Info}, and {News, Info} => Home means that visitor tend to visit these tabs in February.

Table II: Association Rules Using FP-Growth Algorithm in February
NoPromisesConclusionSupportConfidenceLift
1Home, InfoNews0.110.440.91
2Educational ProgramRegistration0.200.451.18
3NewsInfo0.230.481.11
4News, InfoHome0.110.480.77
5RegistrationHome0.190.490.78
6RegistrationEducational Program0.200.511.18
7InfoNews0.230.521.08
8NewsHome0.26.0550.87
9InfoHome0.260.580.92
10Educational ProgramHome0.260.600.95

These results support the fact that visitor of the educational institution website is looking for the information and news about the educational institution itself due to the following month is open registration agenda for new students’ enrollment. In March, visitors are still interested in finding the updated news of the educational institution. It is shown as the result of rules is {Home} =>
{News}. Besides, the rules {Home, Registration} => {Educational Program} means that visitors get interested in registration and tend to view the educational program tab views as well.

Table III: Association Rules Using FP-Growth Algorithm in March
NoPromisesConclusionSupportConfidenceLift
1HomeNews0.270.460.89
2RegistrationHome0.210.490.82
3Educational ProgramRegistration0.210.491.15
4RegistrationEducational Program0.210.491.15
5Educational Program, RegistrationHome0.100.490.83
6Home, RegistrationEducational Program0.100.501.17
7InfoHome0.200.510.86
8NewsHome0.270.530.89
9InfoNews0.220.561.10
10Educational ProgramHome0.250.580.97

In April, the most searched information is about Registration. The rules {Home} => Registration and {University Selection Process} => {Registration} describe that visitor tend to consolidate to do the registration after visiting home and university selection process tab views.

Table IV: Association Rules Using FP-Growth Algorithm in April
NoPromisesConclusionSupportConfidenceLift
1HomeRegistration0.330.500.95
2InfoHome0.120.510.77
3University Selection ProcessRegistration0.100.510.96
4Educational ProgramRegistration0.190.531.00
5NewsHome0.240.540.83
6InfoNews0.130.561.26
7Registration, Educational ProgramHome0.110.560.86
8RegistrationHome0.330.620.95
9Educational ProgramHome0.230.630.96

The distinctive pattern is shown in Table V.

Table V: The Distinctive Association Rules for Three Months
MonthsPromisesConclusionSupportConfidenceLift
FeburaryHome, InfoNews0.110.440.91
NewsInfo0.230.481.11
News, InfoHome0.110.489.77
MarchHomeNews0.270.460.89
Home, RegistrationEducational Program0.100.500.17
AprilHomeRegistration0.330.500.95
University Selection ProcessRegistration0.100.510.96

After seeing the result, the page visit trend can be defined. The result can be used for the educational institution to reveal hidden patterns and understand visitors’ behavior in accessing information through the website. The educational institution may able to connect the most frequent access tab views as link recommendations. Thus, visitors can access information on the website more easily. This can be applied to any events either in the daily activity which should be published on the website. The ease of using a website represents better quality and the more effectiveness of the website, which means that it helps visitors to find information easily.

Chapter 6

CONCLUSION

This research applies the concept of pattern mining in the educational institution selection process website using association rules to discover the interesting pattern in information searching. FP-Growth algorithm was applied due to its efficiency in pattern discovery. Visitor’s weblog data in February, March, and April were processed and compared as those three months have the highest number of visitors’ activities in accessing website information in a year. The results show that there are seven similar patterns formed among those three months. However, there are also some distinctive patterns of visitors’ behavior in each month which may occur due to the selection process agenda. Based on the resulted patterns, the educational institution can create a connecting link or link recommendation for each tab view which has a higher correlation with others so that it will help visitors access and find information easier in the educational institution selection process website.

REFERENCES

[1]. Rahmi Azitha, Isti Surjandari, Enrico Laoh Industrial Engineering Department, and Mining Information Search Pattern on Website: A Case Study of Educational Institution, 2018 5th International Conference on Information Science and Control Engineering.

[2]. Yuanyuan Liao, The Application of Web Mining in Distance Education Platform, Proceedings of the 2nd International Symposium on Computer, Communication, Control and Automation (ISCCCA-13).

[3]. G. Neelima1, and Sireesha Rodda, An Overview on Web Usage Mining

[4]. Wikipedia, URL: https://www.wikipedia.org/

[5]. IEEE explore, URL: https://ieeexplore.ieee.org/Xplore/home.jsp/

[6]. Toward Data Science, URL: https://towardsdatascience.com/

[7]. Google, URL: https://www.google.com/

Incoming search terms:
  • https://www rabinsxp com/research/?amp

Rabins Sharma Lamichhane

Rabins Sharma Lamichhane is the owner of RabinsXP who is constantly working for increasing Internet of Things (IoT) in Nepal. He also builds android apps and crafts beautiful websites. He is also working with various social services. The main aim of Lamichhane is to digitally empower the citizens of Nepal and make the world spiritually sound better both in terms of technology and personal development.

You may also like...

2 Responses

  1. Bishwa says:

    hello Rabin Sir Very good article, I things your articles is very useful for IT student who is studying Data Mining in their course and for there who are interest in this topic you write great articles with great descriptions!! appreciated with your work rates. Thanks for sharing, keep up the good work.

Leave a Reply

Your email address will not be published. Required fields are marked *