As the amount of data on the web increases, the web structure graph, which represents the web as a graph, is also evolving. The structure of this graph has shifted from being based on content to being non-content-based. Additionally, spam data, such as noisy hyperlinks, in the web structure graph can negatively impact the speed and efficiency of information retrieval and link mining algorithms. Previous research in this field has concentrated on eliminating noisy hyperlinks through structural and string-based methods. However, these methods may mistakenly eliminate valuable links or fail to identify noisy hyperlinks in certain situations. In this paper, we begin by constructing a data collection of hyperlinks using an interactive crawler. We then examine the semantic and relatedness structure of the hyperlinks using semantic web tools such as the DBpedia ontology. The removal process of noisy hyperlinks is performed using a reasoner on the DBpedia ontology. Our experiments demonstrate the accuracy and effectiveness of semantic web technologies in eliminating noisy hyperlinks.
Johnson, M. R., & Rodriguez, C. A. (2022). Unveiling Noisy Hyperlinks: A Semantic Analysis Framework. International Journal of Information Retrieval, 30(4), 487–502.
Park, S., & Kim, H. (2021). Exploring the Impact of Noisy Hyperlinks on User Satisfaction: A Case Study. Journal of Human-Computer Interaction, 18(3), 215–230.
Chen, X., & Wang, Q. (2020). Semantic Enhancement of Hyperlink Relevance: An Approach for Noise Reduction. IEEE Transactions on Computational Intelligence and AI in Games, 12(5), 789–802.
Keller, M., & Nussbaumer, M. (2011, September). Beyond the web graph: Mining the information architecture of the WWW with navigation structure graphs. 2011 International Conference on Emerging Intelligent Data and Web Technologies. Tirana, Albania. doi:10.1109/eidwt.2011.2
Kunder, M. d. (2015). The size of the World Wide Web (The Internet) Retrieved from http://www.worldwidewebsize.com/
Wu, Y., Wu, Y., Liu, Y., & Shi, T. (2022, March). The research of the optimized solutions to Raft consensus algorithm based on a weighted PageRank algorithm. 2022 Asia Conference on Algorithms, Computing and Machine Learning (CACML). Presented at the 2022 Asia Conference on Algorithms, Computing and Machine Learning (CACML), Hangzhou, China. doi:10.1109/cacml55074.2022.00135
Ercan, G., & Cicekli, I. (2007). Using lexical chains for keyword extraction. Information Processing & Management, 43(6), 1705–1714. doi:10.1016/j.ipm.2007.01.015
Samanta, D., Dutta, S., Galety, M. G., & Pramanik, S. (2022). A Novel Approach for Web Mining Taxonomy for High-Performance Computing. In Cyber Intelligence and Information Retrieval (pp. 425–432). Singapore: Springer.
Qi, X., Nie, L., & Davison, B. D. (2007). Measuring similarity to detect qualified links. Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web. Presented at the AIRWeb’07: AIRWeb’07, Third International Workshop on Adversarial Information Retrieval on the Web, Banff Alberta Canada. doi:10.1145/1244408.1244418
Wookey, L., & Geller, J. (2004). Semantic hierarchical abstraction of web site structures for web searchers. Journal of Research and Practice in Information Technology, 36(1), 23–34.
Carvalho, -Da Costa, Chirita, A. L., De Moura, P.-A., Calado, E. S., & Nejdl, P. (2006). Site level noise removal for search engines. In Paper presented at the Proceedings of the 15th international conference on World Wide Web.
Chakrabarti, S. (2001). Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the 10th international conference on World Wide Web.
Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004). WordNet:: Similarity: measuring the relatedness of concepts.
Oguz, R. F., Oz, M., Olmezogullari, E., & Aktas, M. S. (2022). Extracting information from large scale graph data: Case study on automated ui testing. In European Conference on Parallel Processing (pp. 364–375). Cham: Springer.
Bechhofer, S., Harmelen, F. v., Hendler, J., Horrocks, I., McGuinness, D. L., Patel-Schneider, P. F., & Stein, L. A. (2004, 12 November 2009). OWL Web Ontology Language. Retrieved from http://www.w3.org/TR/owl-ref/
Wu, B., & Davison, B. D. (2005). Identifying link farm spam pages. Special Interest Tracks and Posters of the 14th International Conference on World Wide Web - WWW ’05. Presented at the Special interest tracks and posters of the 14th international conference, Chiba, Japan. doi:10.1145/1062745.1062762
Elakkiya, E., & Selvakumar, S. (2022). Stratified hyperparameters optimization of feed-forward neural network for social network spam detection (SON2S). 1–20.
Solanki, S., Verma, S., & Chahar, K. (2022). A Comprehensive Study of Page-Rank Algorithm. In Evolution in Computational Intelligence (pp. 1–10). Singapore: Springer.
Manning, C. D., Raghavan, P., & Schutze, H. (2012). Introduction to Information Retrieval. doi:10.1017/cbo9780511809071
Kupiec, J., Pedersen, J., & Chen, F. (1995). A trainable document summarizer. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’95. Presented at the the 18th annual international ACM SIGIR conference, Seattle, Washington, United States. doi:10.1145/215206.215333
Lott, B. (2012). Survey of Keyword Extraction Techniques. UNM Education.
Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for IDF. The Journal of Documentation; Devoted to the Recording, Organization and Dissemination of Specialized Knowledge, 60(5), 503–520. doi:10.1108/00220410410560582
Mccandless, M., Hatcher, E., & Gospodnetic, O. (2010). Lucene in Action: Covers Apache Lucene 3.0. Manning Publications Co.
Taghandiki,K. and Rezaei Ehsan,E. (2022). Implementation of a System for Removing Noisy Hyperlinks: A Semantic and Relatedness-Based Approach. Transactions on Machine Intelligence, 5(2), 122-138. doi: 10.47176/TMI.2022.122
MLA
Taghandiki,K. , and Rezaei Ehsan,E. . "Implementation of a System for Removing Noisy Hyperlinks: A Semantic and Relatedness-Based Approach", Transactions on Machine Intelligence, 5, 2, 2022, 122-138. doi: 10.47176/TMI.2022.122
HARVARD
Taghandiki K., Rezaei Ehsan E. (2022). 'Implementation of a System for Removing Noisy Hyperlinks: A Semantic and Relatedness-Based Approach', Transactions on Machine Intelligence, 5(2), pp. 122-138. doi: 10.47176/TMI.2022.122
CHICAGO
K. Taghandiki and E. Rezaei Ehsan, "Implementation of a System for Removing Noisy Hyperlinks: A Semantic and Relatedness-Based Approach," Transactions on Machine Intelligence, 5 2 (2022): 122-138, doi: 10.47176/TMI.2022.122
VANCOUVER
Taghandiki K., Rezaei Ehsan E. Implementation of a System for Removing Noisy Hyperlinks: A Semantic and Relatedness-Based Approach. Trans. Mach. Intell., 2022; 5(2): 122-138. doi: 10.47176/TMI.2022.122