Stylometry: A Bird's Eye View

Posted 1 month, 3 weeks ago | Originally written on 6 Mar 2024

Now that large language models (LLMs) have become part and parcel of how we conduct text authoring, there is a danger that this will result in a loss of linguistical diversity in style as more and more people lean on LLMs to expand on their ideas. I'll readily admit that I have used this in non-essential contexts but I'll be categorical in stating that the content of this blog is exclusively an expression of my writing style.

It will therefore become essential to develop stylometric tools which can distinguish one writer from another as well flag content that is likely to be the result of heavy LLM use. Writers may now be called upon to demonstrate the development of their ideas through an evolving log of their work in which it will be evident how the work as naturally progressed with time.

In this post, however, I would like to outline what I believe would be a pragmatic approach towards such stylometric analysis.

Input

Obviously, the input would be a block of text––the more the merrier. Whether formatting is important is good question; I can see how an authors attention to formatting detail may serve as an identifying marker.

Output

What should such a tool output?

  • A measure of uncertainty of the result which is a (linear/non-linear) function of the input size e.g., a single sentence would have a very high level of uncertainty while a paragraph or two should give adequate estimates of identity.
  • The measure may be assigned on a more granular level e.g. word, sentence, paragraph, section etc.
  • It should be possible to get an a priori estimate independent of any other so that comparisons can gauge the degree to which both were generated by the same source.
  • Text segmentation: it should be possible to identify if multiple sources/authors contributed to a single document.
  • Global writing score: perhaps it should provide some measure indicating to what extent an author can continue developing his skill.