Finding the Potential Accepted Answer on Stack Overflow: a Text Mining Approach

Jamshidiyan Tehrani, M.; Arjomand, P.; Haghighat, S.

doi:10.47176/TMI.2021.238

Finding the Potential Accepted Answer on Stack Overflow: a Text Mining Approach

Document Type : Original Article

Authors

M. Jamshidiyan Tehrani ¹

P. Arjomand ²

S. Haghighat ²

¹ Faculty of Informatics, Università della Svizzera Italiana, Lugano, Switzerland

² Department of Computer Engineering, Salman Farsi University of Kazerun, Taleghani, Kazerun, 73175-457, Fars, Iran

10.47176/TMI.2021.238

Abstract

Stack Overflow serves as a widely-used, community-driven platform where developers seek assistance with programming-related issues. While the platform allows users to post questions and receive multiple answers, a significant portion of these questions do not culminate in an accepted solution. This lack of a clearly identified best answer often results in confusion for both the original poster and future visitors, as well as increased time spent navigating through numerous responses. To address this challenge, we present a method for automatically identifying the most promising answer among unaccepted ones. Our approach involves the application of text mining techniques to extract 13 informative features from a large dataset comprising 15,464 questions, 37,275 answers, and 72,025 comments. These features capture various textual, structural, and user-related aspects of the posts. The extracted data are then used to train machine learning models aimed at predicting the answer most likely to be accepted. The study focuses solely on English-language content available on Stack Overflow. The proposed method demonstrates promising performance, achieving an overall accuracy of 71% and an F1 score of 70%. These results suggest that automated answer recommendation can significantly enhance the user experience by reducing ambiguity and improving the efficiency of information retrieval on Q&A platforms.

Keywords

Stack Overflow

Data Mining

Text Mining

Machine-Learning

Sentiment Analysis

Faisal, M. S., et al. (2019). Expert ranking techniques for online rated forums. Computers in Human Behavior, 100, 168–176. https://doi.org/10.1016/j.chb.2018.06.013
Anderson, A., et al. (2012). Discovering value from community activity on focused question answering sites: A case study of Stack Overflow. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 850–858). https://doi.org/10.1145/2339530.2339665
Begel, A., et al. (2013). Social networking meets software development: Perspectives from GitHub, MSDN, Stack Exchange, and TopCoder. IEEE Software, 30(1), 52–66. https://doi.org/10.1109/MS.2013.13
Singh, V., et al. (2009). Users of open source software—How do they get help? In Proceedings of the 42nd Hawaii International Conference on System Sciences (pp. 1–10). IEEE. https://doi.org/10.1109/HICSS.2009.259
Storey, M.-A., et al. (2010). The impact of social media on software engineering practices and tools. In Proceedings of the FSE/SDP Workshop on Future of Software Engineering Research (pp. 359–364). https://doi.org/10.1145/1882362.1882435
Vasilescu, B., et al. (2014). How social Q&A sites are changing knowledge sharing in open source software communities. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (pp. 342–354). https://doi.org/10.1145/2531602.2531659
Parnin, C., et al. (2012). Crowd documentation: Exploring the coverage and the dynamics of API discussions on Stack Overflow. Georgia Institute of Technology, Tech. Rep, 11.
Mamykina, L., et al. (2011). Design lessons from the fastest Q&A site in the west. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 2857–2866). https://doi.org/10.1145/1978942.1979366
Deterding, S., et al. (2011). Gamification: Using game-design elements in non-gaming contexts. In CHI'11 Extended Abstracts on Human Factors in Computing Systems (pp. 2425–2428). https://doi.org/10.1145/1979742.1979575
Capiluppi, A., et al. (2012). Assessing technical candidates on the social web. IEEE Software, 30(1), 45–51. https://doi.org/10.1109/MS.2012.169
Naghashzadeh, M., et al. (2021). How do users answer MATLAB questions on Q&A sites? A case study on Stack Overflow and MathWorks. In Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 559–563). IEEE. https://doi.org/10.1109/SANER50967.2021.00059
Pundge, A. M., et al. (2016). Question answering system, approaches and techniques: A review. International Journal of Computer Applications, 141(3), 1–8. https://doi.org/10.5120/ijca2016909587
Yazdaninia, M., et al. (2021). Characterization and prediction of questions without accepted answers on Stack Overflow. In Proceedings of the 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC) (pp. 1–11). IEEE. https://doi.org/10.1109/ICPC52881.2021.00015
Diyanati, A., et al. (2020). A proposed approach to determining expertise level of Stack Overflow programmers based on mining of user comments. Journal of Computer Languages, 61, 101000. https://doi.org/10.1016/j.cola.2020.101000
Pan, Y., & Zhang, J. Q. (2011). Born unequal: A study of the helpfulness of user-generated product reviews. Journal of Retailing, 87(4), 598–612. https://doi.org/10.1016/j.jretai.2011.05.002
Calefato, F., et al. (2018). Sentiment polarity detection for software development. In Proceedings of the 40th International Conference on Software Engineering (pp. 1–12). https://doi.org/10.1145/3180155.3182519
Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 168–177). https://doi.org/10.1145/1014052.1014073
Hu, M., & Liu, B. (2004). Mining opinion features in customer reviews. In Proceedings of the 19th National Conference on Artificial Intelligence (pp. 755–760). AAAI Press.
Fellbaum, C. (1998). WordNet: An electronic lexical database. MIT Press. https://doi.org/10.7551/mitpress/7287.001.0001
Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38(11), 39–41. https://doi.org/10.1145/219717.219748
Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics (pp. 133–138). https://doi.org/10.3115/981732.981751

Volume 4, Issue 4
Autumn 2021
Pages 238-244

XML

PDF 372.16 K

Receive Date 17 June 2021
Revise Date 28 August 2021
Accept Date 23 December 2021

Article View 239
PDF Download 111

Transactions on Machine Intelligence

Finding the Potential Accepted Answer on Stack Overflow: a Text Mining Approach

Volume 4, Issue 4Autumn 2021Pages 238-244

Files

History

Share

How to cite

Statistics

Volume 4, Issue 4
Autumn 2021
Pages 238-244