Introduction
Even in the "post-Covid" world where brick-and-mortar shopping has once again become more available, e-commerce continues to boom and consumers continue to break new records in spending online [1]. This research uses all Reddit activity data, including posts and comments on posts, from January 2021 to August 2022 to extract valuable business insights for Glossier, a cosmetics brand primarily in the e-commerce space. Founded in 2014, Glossier is a direct-to-consumer company headquartered in New York that came in hot to the beauty industry. In fact, in just a month after launching, Glossier had 60,000 names on a waitlist for their nine products [2]. In March 2019, Glossier closed a $100 million series D funding that resulted in a company valuation of $1.2 billion. Figure 1 displays important milestones in the company’s history and emphasises the rapid growth of the company in a relatively short period of time.
Figure 1
Glossier is a DTC company, meaning it stays in complete control of its relationship with customers with no third-party intermediaries. “In beauty, it’s really important to look at the products that are used together,” says Andrew Stephen, L’Oréal professor of marketing at the University of Oxford’s Saïd Business School. “The bundles that are used to put a look together become really important for consumer insights, and a DTC model tells you that right off the bat.” The control Glossier has had over its own image and customer relationship has made it a staple amongst millennial women and an interesting business case for our team.
The goal of this project is to walk away with new and interesting insights on Glossier's customers as well as discover actionable items for potential direct/indirect increases in revenue. Ten granular business questions, broken down by consumer, demand, product, and competition, are explored and answered using NLP and ML techniques using Spark in Azure Databricks. These business questions are included in the appendix below. This project is important as it serves as an example of analysis for companies looking to gain a greater market share that does not include easily accessible customer tracking data purchased from third parties. Most importantly, Glossier will have insights to determine the most optimal launch and marketing strategies in the midst of their high growth and physical presence expansion.
The signature strategies that made Glossier special — their DTC, digital-first business model, robust social media, diverse models, chic packaging, and concept stores — are now considered commonplace in the beauty industry, especially as the pandemic pushed DTC and e-commerce forward. New popular brands like Drunk Elephant, Glow Recipe, Milk, Lilah B, RMS Makeup, Make, Versed, and Pixi are also DTC, millennial-focused “no makeup makeup” brands. In addition to this, Glossier is far from being the only beauty unicorn. In 2018, a third of female-owned unicorns were in beauty, including Pat McGrath Labs, Kylie Jenner’s Kylie Cosmetics, Pat McGrath Fatigue and Huda Kattan’s Huda Beauty [3]. As Glossier has taken a nontraditional route to sell its products and has been in the beauty industry for some time now, consistently looking for ways to stay relevant is vital for growth in the competitive industry the brand is a part of. Figure 2 below was created with data from Social Blade, a public database which tracks user statistics for YouTube, Twitch, Instagram, and Twitter. As Glossier is almost entirely in the online space, their online presence on social media platforms is very important. One can see that their following has declined and been stagnant as of recent, which is all the more reason why we find it important to look for opportunities for new growth.
Figure 2
Business Questions Answered
Business Goal 1: Should Glossier focus on store expansion? If so, where should Glossier open their next store?
We will gather Twitter user data, r/Glossier posts and comments, as well as Reddit posts that fall outside of the dates in the parquet files to gather where users are from. Then, we will merge the data with a geospatial dataset and count the number of posts by sentiment for each geospatial area. We will represent these counts for each location on a choropleth map to inform executives of potential areas for store expansion.
Business Goal 2: What products should be in the newest Glossier kit?
We will use parsing techniques to identify which posts and comments in the Glossier subreddit contain Glossier products. We will conduct sentiment analysis of each post to assign positive, negative, or neutral values. We will sum the number of posts and comments by product for positive posts. We will identify the top 10 products with the highest activity (positive sentiment). Similarly, we will sum the number of posts and comments by product for negative posts. We will identify the top 10 products with the highest activity (negative sentiment). We will display this information on separate charts to compare products within each sentiment group, which will show executive audiences which products should be promoted and which should potentially be discontinued.
Business Goal 3: Which products are most common amongst competitors (Sephora, Ulta, and Fenty)? Does this open up an opportunity to capitalize on market share?
For two of Glossier’s competitors, we will use parsing techniques to identify which posts and comments in the competitor subreddits contain the products identified from Glossier’s website. Similar to the business objective above, we will conduct sentiment analysis of each post to assign positive, negative, or neutral values. We will sum the number of posts and comments by product for positive posts. We will identify the top 10 products with the highest activity (positive sentiment). Similarly, we will sum the number of posts and comments by product for negative posts. We will identify the top 10 products with the highest activity (negative sentiment). We will display this information on a grouped bar chart to depict the worst performing products and highest performing products of our competitors.
Business Goal 4: At what point in the year will demand for Glossier be highest? How does this compare with competitors?
We will use a SARIMA model to predict the number of submissions and comments to the Glossier subreddit, with the number of submissions and comments to the Glossier subreddit used as a proxy for demand. We will difference the data to induce stationarity and select the model with the lowest AIC score as that of the optimal model. The same processes will be completed to model the Ulta subreddit submissions and comments in order to obtain a comparison with competitive data.
Business Goal 5: How has demand for Glossier been affected by COVID-19 rates? Can we get an understanding of what the relationship between the two may look like for the future?
To accomplish this, we will join external COVID-19 rate data to the Glossier subreddit activity data by day. Like above, we will identify the number of posts and comments that Glossier is mentioned in by day. We will also aggregate the total COVID-19 cases by day. To measure the effect of the COVID-19 rates on the demand, we will develop a multivariate time series ML model to forecast disease rates in conjunction with the demand. As mentioned above, this information will also be depicted on line charts to easily see the relationships and patterns between the two variables and the forecasts over time.
Business Goal 6: What is the average persona of a Glossier customer? In other words, what does our consumer base look like?
For users who post in the Glossier, Ulta, Fenty, and Sephora subreddits, we will identify their total activity by joining other subreddits and counting the number of other subreddits these users post to. We will also get the average score of the posts for each user. We will then use NLP techniques to identify the sentiment score of each post. We will then average the sentiment score and score and activity for each brand to get glean what the average persona looks like across competitors. The output will be depicted via bar charts to contrast competitor information.
Business Goal 7: Are user interactions on Reddit predictive of their overall sentiment towards the brand?
Leveraging the total activity, score, whether or not the post was stickied, whether or not the post was edited, and sentiment variables, we will use ML and develop a classification model for sentiment prediction. To attain the best accuracy, different models - random forest, decision tree, and logistic regression - will be developed. The model will be trained on 80% of the dataset and predictions will be run on a test set. Each model will be tuned to obtain optimal hyperparameters. The accuracies of each model will be visualized to depict the efficacy of prediction to technical executives.
Business Goal 8: Can user interactions predict the exact sentiment of a user towards Glossier and how can this inform rollout strategy?
We will leverage ML to develop a regression model and two separate random forest regression models, each with different parameter sets, for predicting the average user sentiment score. In this case, we will calculate a compound sentiment score of each post. A custom sentiment score will be derived from the sentiment categories in the previous objective. The average score will then be determined and attributed to each user. Leveraging machine learning, a linear regression model will be built to predict the sentiment score of a user based on total activity, score, whether or not the post was stickied, and whether or not the post was edited. We will run predictions on a test set and tune the hyperparameters to achieve highest accuracy. If we can accurately predict sentiment score, we can experiment with more targeted marketing strategies and gauge incremental impact.
Business Goal 9: Are Google trend searches predictive of daily user sentiment?
To accomplish this, we will join a customized google search trends dataset with the Glossier competitor dataset. To assess if google trend searches are predictive of daily user sentiment, both regression and random forest models will be created. In order to gather user sentiment, NLP analysis will be conducted in order to get a daily sentiment score broken down by brand. This will then be joined with the daily Google trends data in order to prepare it for the linear regression and random forest model.
Business Goal 10: Is sustainability on the top of Glossier customers minds? Should Glossier focus its efforts on making their products and practices more sustainable?
As eco-friendly options in the ecommerce space as well as the conversation of sustainability have grown over the past few years, we hope to gauge how important environmentally conscious practices and products are to the Glossier consumer. In order to gain a better understanding of this, we will perform topic modeling on posts and comments in the Glossier subreddit. We plan to use LDA to group these posts by topic. We may discover that sustainability is not a top priority to our customers but might instead gain insights as to what topics are on the customers top of mind (new products, store expansion, etc.).
Works Cited
1. Sophia, Deborah. “U.S. Black Friday Online Sales Hit Record $9 Bln despite High Inflation- Adobe Analytics.” Reuters, Thomson Reuters, 26 Nov. 2022, https://www.reuters.com/business/retail-consumer/adobe-says-black-friday-online-sales-hit-record-9-bln-2022-11-26/.
2. Goldfine, Jael. “How Glossier Went from Makeup Blog to Industry-Changing DTC Superstar.” The Business of Business, Thinknum, 3 May 2021, https://www.businessofbusiness.com/articles/history-of-glossier-dtc-beauty-makeup/.
3. Turk, Victoria. “How Glossier Turned Itself into a Billion-Dollar Beauty Brand.” WIRED UK, 6 Feb. 2020, https://www.wired.co.uk/article/how-to-build-a-brand-glossier.