When life gives you lemons: Analyzing negative reviews to improve your mobile app
August 21, 2019 • 6 min read
Your product is good. Your mobile app has great features, but your app store rating is low. What happened?
In a previous blog post we demonstrated that high crash rates can cause low app ratings. In this blog, we explain how we analyzed negative reviews to determine what motivated customers to write them.
We collected the most relevant negative app reviews, 2 stars or less, from six major retailers in the US on one of the major mobile applications platforms. Then, we employed natural language processing to analyze the reviews and distill specific topics that caused the complaints. We carried out this topic modeling phase with Latent Dirichlet Allocation to better understand the distribution of negative reviews with respect to common topics that users tended to bemoan.
So let’s dive into the details on how we did our negative review analysis with NLP.
Break down sentences and lemmatize everything
The first step in topic modeling is to convert sentences to individual words and phrases. These words and phrases will then be fed to the model to generate common topics. Typically, sentence breakdown consists of two parts: lemmatization, and removal of stop words and punctuation.
One key feature in human language (the English language in particular) is that one word can have a number of derivations. For instance, “I ran five miles yesterday”, “I will run five miles tomorrow”, “She is running”, and “He runs” all convey the same action. The English language changes the form of the word “run” to indicate the time at which the event takes/took/will take place and the persons pertaining to the action. To a computer, all variations of “run” should point to the same action, i.e. “run”. In this case, “run” is a lemma word for “runs”, “ran”, “running”, etc. The process of converting all derivative words to their lemma is called lemmatization. Fortunately, Python NLTK provides WordNetLemmatizer that uses its corpus database to lookup lemmas for words.
Having lemmatized the reviews, we wish to remove stop words, words which have a high frequency but do not contribute meaning to the sentence. Stop words include, but are not limited to, “and”, “but”, “as”, “whom”, and “at”. Similarly, we generally want to omit punctuation as well for model fitting. Usually, what is left after removing stop words and punctuation is a range of verbs and adjectives, which generally confer more meaning percentage-wise than having all the stop words and punctuation in.
Below is a code snippet that demonstrates lemmatization and removal of stop words and punctuation:
Time to fit the model
Now, we are ready to conduct topic modeling with Latent Dirichlet Allocation. This can be done with the help of the Gensim library available in Python. There are several crucial steps in LDA as follows:
- Create a dictionary from the processed review data. A dictionary is an aggregation of all the words from a collection of text.
- Convert the dictionary to a bag-of-words corpus and save the dictionary as well as the corpus for analyses. A bag-of-words model is a simplified representation that transforms a collection of text into unique words in a dictionary, as well as their frequencies with which they appear, or multiplicities. For instance, “I like running. You like running.” becomes {“I”:1, “you”:1, “like”:2, “running”:2}. The bag-of-words model is ideal for analyses because it can be easily represented in a matrix form. That is, {“I”:1, “you”:1, “like”:2, “running”:2} becomes [1, 1, 2, 2], which each entry corresponding to a unique word in the dictionary in a known order. If desired, one can easily combine different bag-of-words corpora by carrying out matrix operations. This conversion can be done in the following way:
- Run LDA and call topics. Here, we call the top 20 topics extracted from all the negative user reviews. For instance, the first topic, topic 0, consists of keywords “app”, “card”, “pay”, “can’t”, “get”, “make”. This sounds like users are having difficulties making online payments using debit/credit card on the mobile application. Similarly, topic 2 with “slow”, “app”, “very”, “make”, “app”, “crash” indicates sub-par app performance and high crash rate. Since these topics are auto-generated, some are less intuitive than others. This can be remedied by tuning hyper-parameters such as the number of topics and the number of algorithm iterations needed to come up with one topic. In any case, we can have a rough idea of some of the main topics that unsatisfied users frequently bring up in-app reviews.
It is possible to zero in on a specific user review, and LDA can classify that piece of review as one of the many topics generated. For instance, one user review reads:
“inputed[sic] first Macy’s card in and no issues to pay or check balance. once card was upgraded it would not show any balance, or able to mqke[sic] payment.”
LDA classifies this review as Dominant_Topic 0.0 (aforementioned credit card issues) with 78% certainty. We can manually double-check and have a rough idea of how well the algorithm performs.
Results: The crash rate is key, among other things…
According to our results from topic modeling, 64.4% of all negative reviews are caused by three broad categories:
- Crashes / Slow response
- Buggy checkout experience
- Missing items in shopping cart
Here is the complete breakdown of the issues:
Crashes and slow response time accounted for 29.3% of all negative reviews. Obviously, crash rates should be as low as technically possible. Psychologically, a negative review will carry more weight than a positive review, so the industry-accepted 1% may still be too high for large enterprises. We generally work with customers to ensure that the crash rate is 0.2% or lower so as to prevent crashes almost entirely and prevent lower app ratings.
The checkout process generated 18.4% of the negative reviews. This includes long and complicated checkout processes, a bug-ridden payment experience, etc. Users become extremely frustrated when they have selected the items of their liking only to find out that they cannot smoothly and successfully complete the transaction. Features like a smooth, one-page checkout not only close the proverbial circle for potential customers, but reflects the brand image and the company’s attention to detail.
Shopping cart issues accounted for 16.7% of negative reviews. Unsatisfied users report the disappearance of shopping cart items, issues with non-inventory items, etc. According to the Baymard Institute, around 70% of virtual carts are abandoned before purchase due to unexpected taxes and fees, having to create an account, or that the checkout process is too complicated.
Conclusion
Negative feedback occurs on all sites. This investigation was conducted using data from several large, national retailers. However, the results are likely typical of most mobile retailers. Retailers should conduct a detailed analysis of their own negative reviews, using techniques used in this blog post, to find the causes. This “free QA” is particularly valuable because it is unsolicited and the writers expect nothing in return. They tend to represent the “naked truth”. Fixing issues discovered in this analysis addresses user’s concerns head-on, ultimately leading to higher app ratings, a salvaged brand reputation and increased revenue.
Limitations and future improvements
Topic Modeling with Latent Dirichlet Allocation is a powerful tool to uncover hidden commonalities from a vast swath of information. It helps identify hidden topics and relations between a sentence and the topic it is most closely related. In this study, we have condensed over 4000 negative reviews into eight categories. It is important to note, however, that LDA does not take into account the correlation between topics. For instance, we may insist that discounts and promotions be grouped with the payment process. After all, discounts are applied at the time of payment. Discounts and promotions have a high correlation to problems with payment. However, there is no readily available approach for LDA to recognize the correlation between these topics.
Additionally, the bag-of-words model on which LDA functions primarily concerns unique words and their respective occurrences; it cannot understand higher-level concepts such as semantic structures. Lastly, LDA is an unsupervised algorithm, which may not be the best option for training and testing tagged dataset.
We hope to revisit mobile application reviews using other natural language processing algorithms in the future to better parse and understand users’ intent with each review. In any case, we hope our current study proves illuminating for you and your business so that you can reach more users with the help of mobile applications and boost your mobile revenue in the years to come.
For assistance conducting analysis on your site, please contact Grid Dynamics.