PARALLEL DECISION TREE ALGORITHM FOR MULTITEXT CLASSIFICATION BASED ON SPARK
##plugins.themes.bootstrap3.article.main##
Abstract
One of the most challenging issues in the big data research area is the inability to process a large volume of information in a reasonable time. Hadoop and Spark are frameworks for distributed information processing. Hadoop is a very famous and standard platform for massive facts processing. Because of the in-memory programming version, Spark as an open-supply framework is suitable for processing iterative algorithms. With the rapid growth of data amount and feature space dimension under the background of big data, the parallelization of traditional multitext classification algorithms will significantly improve its running efficiency. In this paper, Spark frameworks, the big data distributed processing platforms, are evaluated and compared in terms of Precision, Accuracy and Recall. Hence, the parallel j48 pruned decision tree classification algorithm is implemented on datasets with different sizes within Spark. The results show that the runtime of the parallel j48 pruned decision tree classification algorithm implemented on Spark is faster than Hadoop. Evaluations show that Hadoop makes use of greater sources, such as crucial processor and network. It is concluded that the Spark is more effective than Hadoop.