Comment

Comment on LLMs can unmask pseudonymous users at scale with surprising accuracy

AllNewTypeFace@leminal.space ⁨4⁩ ⁨months⁩ ago

This seems to mostly scale up stylometry (the method of identifying authorship by writing style), a long-established technique. It unmasked the Unabomber in the 90s, as well as the anonymous author of a scandalous book about the Clinton administration. Indeed, one technique some writers use of dodging this is to deliberately write in character in a contrived style (there was an information-security poster on Twitter whose style was modelled on Taylor Swift, for example).

As all things are an arms race, a countermeasure to this would be a locally-hosted language model that can rephrase text into a more neutral style. Install it on your phone, select the text you’ve written and get it to rewrite it, getting something without any regionalisms, turns of phrase or other peculiarities of your writing style that you wouldn’t notice but would identify you given a large enough corpus of your writings. A voice changer for text, if you will.

source

Sort:hotnew top

lemmysmash@beehaw.org ⁨4⁩ ⁨months⁩ ago
From the article it seems that it’s not even stylometry, but profile features extraction from the large amount of text. So, for example, if I have my full true profile somewhere where I never mention something like BDSM but in another place I have a blog specifically about BDSM but intentionally (and let’s assume efficiently) omit or change every single detail about myself there, then, in theory, this particular technique should fail.

But yes, nothing prevents people from using LLMs in the same way for stylometry (and I’m 101% sure that those who are interested in that are already doing so). And yes, local “rewriter” LLM would help to some extent, but I think there has been another research somewhere that LLM-produced text allows to, if not completely recover the original prompt, then at least kind of fingerprint it, so… I wouldn’t fully trust that method either :)

source
- AllNewTypeFace@leminal.space ⁨4⁩ ⁨months⁩ ago
  It mentions style as being among the data points used, along with personal details, though if your hidden account is used for things like whistleblowing or niche erotica, you may not be mentioning telltale biographical details at all often, while you can’t help writing the way you write, with numerous unconscious choices between alternative ways of phrasing things, which will be the bulk of what it has to work with.
  
  Of course, that doesn’t mean you couldn’t slip up, so if you don’t want your posts traced back to you, also look out for any details you’re leaking and file the serial numbers off them (and perhaps rig up a way of delaying your posts outside of your waking hours).
  
  source
ParlimentOfDoom@piefed.zip ⁨4⁩ ⁨months⁩ ago
The Unabomber was identified because his own brother recognized the way he misspoke a common idiom “can’t eat your cake and have it too”. Attributing that to an investigative technique is a bit of a stretch.

source