In the following part, the words ‘cluster’ and ‘topic’ are used interchangeably as one cluster is represented by one topic here.
After selecting quotes using manual keywords we tried unsupervised clustering methods. These methods are promising since they would mean it’s an algorithm that finds the topics and groups up your quotes, therefore potentially removing human bias from the selection of the topics.
The tool that has been used is BERTopic. Stop-words are removed from the quotes because they represent a large number of the words and might take over more important information in the quotes. First, the documents (‘quotations’) in our case are embedded using sentence-BERT, UMAP is then used to reduce the dimensionality of the embeddings. Next, HDBSCAN clusters the quotations into topics using the reduced representation. Finally a modified version of TF-IDF is used on each cluster, combined with MMR (Maximal Marginal Relevance) to choose which words in the cluster represent the cluster, technically the topic, the best.
The main drawback of this method is the stochastic nature of UMAP dimensionality reduction. This means that slightly different clusters were offered at each run, but most of the clusters remained the same. We rand our analysis and saved the model, topics, clusters and keywords to disk to continue our analyses.
Initially, the model proposed approximately 300 topics. Since this would be very hard to work with, these were reduced to 35. A high enough number is taken such that clusters that were far enough from each other wouldn’t be forcibly regrouped. The resulting topics are observed in the bottom two figures :
Fig. Visualisation of the distance between the topics after dimension reduction using UMAP link to interactive plot : link
Fig. Topics containing the most quotations and the words representing them link to interactive plot : link The value is a probability indicationg how likely the word is to represent the topic cluster
An important thing to mention is that there is one very large cluster ( ~30000 quotes, knowing that we have ~55000 quotes for this analysis) containing outliers mainly represented by the keywords women
sexual
harassment
rights
. We accept the behavior of our tool and focus on the remaining quotes that are in clusters.
Taking a more focused look in the topics :
Multiple interesting topics appear, for example the 16th topics seems to be about the me too movement, the 18th topic seems to be about a sexual harassments scandal involving Bill Oreilly link le 20th topic seems to represent a topic containing matters about indigenous people and kidnappings.
Example quotes from the 19th category :
Our ancestors, the matriarchs, were the speakers, the keepers of ceremony, and our oral history. As a young person, an activist talking about women’s rights or about murdered and missing indigenous women, hip-hop has been the best venue to connect with not only my peers and young people, but also the greater public that may have barriers to listening to the stories of First Nations’ indigenous people
our indigenous population is not a haven of gender equality
We can see that the quotes still seem relevant to the initial keyword selection as the gender equality
keyword pops up in one the quotes.
We take some of the most discussed topics and describe what we observe in them :
Outliers
Around half of our quotes are considered outliers. It seems reasonable for certain quotes. However, this is still quite a large percentage and certain quotes should have made it into a topic. For instance, we can find numerous quotes talking about gender equality or sexual harassment which reprensent respectively topics 9 and 10. We are unsure why we have such a large amount of outliers but other methods of clustering such as Fast Clustering gave us similar results. Nevertheless, we are able to cluster some of the data and fins topics.
Child care
This our first topic, which contains the most quotes, around 7000. The main subject of this topic is child care, family and its quality of life. Having the largest topic be child care when talking about women’s rights may be surprising as it does not concern them directly. The main issue raised in these quotes is that child care is unaffordable. This is a real issue for parents in general, but mostly mothers who sometimes stay home to raise their children. At the same time, it is reported that child care workers barely make a living from their job.
Equal pay
This our second topic, composed of approximately 1300 quotes. The main subject of this topic is equal pay and work. A more obvious topic regarding feminonationalism, it is often reported that women are paid less for being… women, but also for being mothers and having to care for their children. Therefore, child care is often brought up in these quotes as well as paid leave for parents.
Is there a cluster representing femonationalist quotes?
A possible candidate cluster is cluster 14 regarding muslim, islam and the oppression of women. We will further talk about this cluster in the last blog post of our website