Stylometrics is the idea that your writing style can be quantified, even fingerprinted. The way you turn a phrase, the average word or sentence length, the vocabulary used, all contribute to a a scientific perspective of your writing style. In terms of OPSEC, stylometrics is an adversarial concern – one of which can be difficult to defend against.
One of the operational security measures that Edward Snowden employed, was to write short bits of information; small paragraphs, a few sentences, but never long prose. This was, in part, because he had concerns that no matter what kind of anyonymity systems or security tactics taken, if an adversary with enough power was able to extract the plain text from one of his messages, it could be correlated back to himself based on previous documents he had written.
Snowden was not yet ready to tell me his name, but he said he was certain to be exposed — by his own hand or somebody else’s. Until then, he asked that I not quote him at length. He said semantic analysis, another of the NSA’s capabilities, would identify him by his patterns of language.
Also to note, this is not merely an NSA capability, intelligence compaies like Palantir posses this capabibility and wrap it into the products that they offer.
In 2007, researchers for the Univesity of Arizona, scrapped the Deep Web blogs, forums, and anything else they could get their hands on, in an attempt to attribute people. They crawled forums run by neo-nazis and devised a way of taking all that information, and separating out who was who on the forums. They used stylometrics to collect enough information to group people up based on a variety of features.
Types of stylometrics
There are two ways in which stylometrics can be used against you. (Both of them have their names borrowed from an approach to Natural Language Processing.) One way, which is called “Supervised,” refers to taking a group of documents with known authors, and attributing them to an unknown author. For instance if we had a writing sample from Edward Snowden, we could attempt to attribute his authoriship of something he anonymously posted, by comparing his writing styles. This type is relatively easy for humans to do themselves.
The second type, “Unsupervised,” leverages computational power to take a pile of anonymous blocks of text, and group them together based on the writing style. The example, the neo-nazi forum research didn’t identify individuals posting to the forum, but they grouped which posts belonged to NaziA and which posts belonged to NaziB.
What makes up your style
There are dozens of models that deliver a perspective on your writing style. They’ve been created by academics and all aim to be more accurate than the next. One called the “9 Feature Model” is merely taking nine properties like sentence length, unique words, a readability index, and some other simple to measure features of the text to come out with a classification. Other, more complex methods like the “Writeprints” method, use hundreds of features to make their conclusions. I’m simplifying this to features classifiers, but these are actually extremely complicated models using neaural nets to generate their results.
If we see this as a threat, we need to mitigate. Researchers who presented at 28C3 define a few ways in which to evade being finterprinted:
- Obfuscation: Generally make the text appear unlike your normal writing style.
- Imitation: Make your writing stile look like someone elses.
(They also recommend what they call “Translation” which is merely taking a text, using an automatic translator service to convert it to a different language, and then change it back. I don’t see any difference between this and “Obfuscation” and it presents a new anonymity issue.)
These researchers even gave us some data that shows how effective each type of mitigation is to defend against. They show us a break down of the stylometric model employed, and how effective each type of evasion tactic was. They also give us a tool to help us anonymize our own documents that works surprisingly well. Althought it’s very complicated.
There are a lot more details about how the tool works and when to use it, but for now, you can take a look at Anonymouth. This is a Java tool that takes the document you’d like to anonymize, compares it to other documents you’ve written, and then takes a variety of texts from other others to compare against. The results give you a breakdown of common words you use, attributing sentence structure, and even some suggestions about how to change the document to defend against attribution.
As this is an Academic research tool, it’s written in Java and awkward to use, but in general, it is pretty effective. They are now working to automate many of its features so that instead of manually anonymizing a document, you can tell the tool a directory of documents that you’d like to anonymize.