MINING THE WEB FOR PLAGIARIZED WEB CONTENT USING SIMILARITY MEASURE TECHNIQUE

##plugins.themes.bootstrap3.article.main##

MATHIYALAGAN M

Abstract

Two important and active areas of current research are data mining and the World Wide Web. A natural combination of the two areas, sometimes referred to as Web mining, has been the focus of several recent research projects and papers. The heterogeneity and the lack of structure that permeates much of the ever expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and management of Web- based information difficult. As with any emerging research area there is no established vocabulary, leading to confusion when comparing research efforts. Different terms for the same concept or different definitions being attached to the same word are commonplace. The term Web mining has been used in two distinct ways. Web content mining is the process of extracting knowledge and information discovery from sources across the World Wide Web. Mirrored web pages are very common in internet. Sometimes, without the knowledge and permission of the owners of the original web page, someone may duplicate the contents of the web page in their page. Finding such plagiarism in the vast internet is a challenging task. In this research we explore the web mining technology and a plagiarism detection paradigm for web mining. A working prototype of the proposed system will be developed partially in C and partially on Matlab. The Integration of the C code with the Matlab code will be done using Matlab Mex DLL interface programming. The performance of the system will be evaluated using suitable Metrics.

##plugins.themes.bootstrap3.article.details##

Section
Articles