The semantic architecture of social unrest

Jun 18, 2019

Text mining the Yellow Jacket forum database.


  • The Yellow Jackets are a spectacular countrywide social protest movement that was active for months in 2018/2019;
  • As a reaction to the protests, the French government organised a nationwide public debate to discuss important political issues;
  • The protesters replied by launching their own online platform, where anyone could propose free-form demands or criticism and put it up to vote;
  • This is a semantic analysis of that database.

The main outputs of this analysis were navigation tools organised by broad social theme. One example, which covers “Democracy and Institutions”, is available here

The full analysis pipeline, as well as the remaining tools and a documentation (in French) are available on my GitHub page.


For a couple of months in winter 2018-2019, the Yellow Jacket protests rocked the French political landscape. Composed of mostly working-class protesters from disinherited areas of peri-urban and rural France, the country-wide demonstrations took place every Saturday, forcing the government to back down on several reforms, and leading to an exercise in organised grassroots democracy dubbed the “great national debate”.

The protests organised organically on social media, had no clearly defined leader, and covered a very broad range of demands. Media struggled to cover the events, focusing either on clashes with police and property damage on the margins of the protests, or on a narrow set of “representative” demands extracted from protesters through street interviews. Since all attempts by the government to organise roundtables with organisers failed, no serious synthesis of the protesters’ demands was ever undertaken.

Mindful of the scale of the dissent, the government moved to clarify popular sentiment by organising what it called the “Great National Debate”


The full analysis protocol is documented on github in the form of Jupyter Notebooks. Unfortunately, they are mostly in French, as I had to document this for my French-speaking client. But here are the main steps in a nutshell:

Preliminary analysis

This was my first contact with text mining, so there was a lot of trial and error, exploratory analyses, and reading going on in the first few days of the project. But that allowed me to get a good grasp of the basic structure of the data:


In natural languages, words rarely remain constant. Verbs, for example, are conjugated (he was, I am), without changing the fundamental meaning of the word. This fundamental form, called a lemma, is crucial for grouping together words by meaning for text analysis. The degree of this morphological variability differs greatly between languages. Compared to English words, French ones are highly variable, since it is a gendered language, where every adjective will change its ending depending on the gender of the word it qualifies. In terms of linguistics, English is an analytic language, where relationships between words are mostly conveyed by small helper words ( with, her, will, would, shall), whereas French is a synthetic fusional language, where words change their forms to indicate these relationships. In this context, lemmatisation (finding the correct lemma for each word) is a fascinating problem, and one that often cannot be solved exactly due to the structure of language itself. For example, the French word “sort” can both be a conjugated verb (he/she/it exits) or a noun (fate). Without studying context, it is impossible to know whether it’s better to lemmatise it to “sortir” (to exit) or to leave it as is. When in doubt, the lemmatiser will report all equiprobable lemmas. Lemmatisation can be further informed by part-of-speech annotation, which determines the semantic class of a word. For example, “sort”, if assigned the “verb” class by a POS tagger, has a single unambiguous lemma. There has been recent research into so-called context-aware lemmatisation, which draws on the meaning of the wider sentence to achieve better performance (unsurprisingly, using neural networks).


In addition to better lemmatisation of known words, context-aware lemmatisers also produces better results for unseen words (particularly present in Internet corpora), such as “soz” instead of “sorry”, or spelling mistakes (e.g. “avaliable”). Like in many disciplines there are two approaches to preparing a text: the traditional one, where you try to take into account every source of noise for a relatively rigid algorithm, or the modern one, where you input your whole dataset into a very flexible one, such as a neural network.

It may still be a good idea to reduce the noise in your dataset anyway, and on of the easy ways to remove noise is to spell-check our dataset. Context-unaware spell checkers suffer from the same limitations as lemmatisers. For example, “hel” is equally likely to be a misspelling of “hell” and “help”.
