
Aditya, Student , IISER Berhampur
The Machines are Taking Over
The attached photograph[1], along with the caption, was the most popular post on the subreddit r/ subreddit simulator at the time this article was written. Seems innocuous enough, just a few cats and a generic caption. The catch? Every post on r/ subredd itsimulator is made by a machine, following which there are a ton of comments, also made by machines. Moreover, such posts are made every 3 minutes. Is there someone, or perhaps a group of people, feeding these seemingly random posts into the bot or have machines finally reached the level of sentience where they can finally post about cats on the internet? To answer that question, we must address a different and seemingly unrelated question: When reading an author’s works, we see what the words show us, and every author paints their world using a unique style. Can this style be quantified mathematically? A rhetorical question, of course, and it is very much related to how machines can post “original” content. After all, both questions analyse mathematics intermingling with linguistics. We take any text from any author in a known (or perhaps unknown) language and analyse it as follows: We check the probability, not of using a character, but of any character appearing given the character before it. This might seem a bit confusing at first, so I’ll illustrate it with the help of a reduced example[2]: (Warning: I’m oversimplifying this.) Rather than considering a book, consider a sample with only 4 characters; a strand of DNA, perhaps? (This is not a test for one’s DNA, of course, so don’t be alarmed if you try this mathematical assay to compare a small segment of DNA from your hair with a small segment of DNA shown below and get an unexpected result.)
GATCATTGATATGTTGCTAGAACTATGAGT GTTAAAGGTGCTTGTGGTGAGTTATCAGA CAGAAACGCAGAAGATGTTATTGGAAGCTT GAGGAAAAGTGATCCTGGATTTACAGTGC CAAGAATTGGCCTGTATTGTGTTCTCAATGTT TTTGAGGAAGGTAGAAACTGTAAGTGATGA
In this case we have four possible characters: A, C, G and T. If we count the number of times in this segment of DNA that A occurs, we find that of the 53 occurrences, A is followed by another A 17 times, or 32.1% of the time, while a C follows only 5 times (9.4%), a G, 17 times (32.1%) and a T 14 times (26.4%). We can then construct a full transition matrix:
This can be turned into a matrix of probabilities by dividing each number by the total for that row, e.g. for A followed by another A we have 17/(17+5+17+14) = 0.320755:
We make a similar matrix, but rather than using the aforementioned string of DNA, we use a full-fledged book. It was shown by Khmelev (2000) that removing punctuation words with capitalised first letters yields a better result, so we just do that to our test script before constructing a transition matrix similar to the one shown above, but a 27x27 matrix (26 letters and a space character) rather than 4x4. Thus, on compiling an author’s works into a single 27x27 matrix, we obtain a mathematical quantification of their writing style. If we want to verify the authorship of a manuscript, we just construct the transition matrix for all potential authors and compare it to the transition matrix of said manuscript and voilà! Crunching a few numbers, we can assign authorship to the document. The exact mathematical working, as well as additional references are given in [3]. Back to the subreddit simulator. What that does is in some way the opposite of the authorship test. It’s a wee bit more complex, though. Rather than analysing letter groupings, it analyses word groupings; it compiles a library of all the human generated text (posted within a certain period of time, say, 3 months) from a particular subreddit and creates a transition matrix specific to that very subreddit. Then, one word out of the library is chosen at random, with a bias towards words used more frequently as the first words in posts. This is the start word. The second word of the sentence is chosen by assigning a probability to all words which follow the chosen start word and using a biased randomiser decided by the past frequency of that word. The same process is repeated till either a period is reached or the post exceeds a certain pre-determined word limit. Therefore, thankfully, or not, we have not yet built machines advanced enough to post about cats on the internet (the cornerstone achievement of human society). Either that or sentient machines exist and are smart enough to play as per our expectations, waiting for a ripe opportunity to swipe away from humanity the moniker of “most intelligent organism.” Who knows.





