Sometimes Being Wrong is the Right Answer
I just finished my second NLP project. Even though this project went much smoother than the first (if you follow the link please wait to see ‘go’ in the ‘The next word is predicted to be:’ box before entering text: LINK) it reminded me that I still have a lot to learn about this fascinating topic. My goal for this second project was to use the text from Stack Overflow questions to predict how many responses that question would receive. Needless to say my predictions did not preform as well as planned. In spite of my efforts clean my coprus tweak the parameters in sklearn’s TFIDF objects, and switching between multinomial naive Bayes and 1 vs all logistical regression, the model never achieved more than 57% accuracy with precision and recall scores less than 50%.
So what went wrong? After examining the text corpus there appear to be a lot of garbage words and character strings that suggests that my efforts to clean the text might have resulted in the enrichment of code fragments at the expense of plain English. There is also the possibility that 57% accuracy with less than 50% precision and recall is the right answer.
To paraphrase Curt Carlson “always avoid BIG ‘A’”; meaning you must always be open to the possibility that your approach (BIG A) may be wrong or may not be the best way to solve the problem or answer the question posed. The question I posed was: “is it possible to predict the number of response to a Stack Overflow question based on the question txt?” Maybe the correct answer is that you can’t.