In an attempt to broaden my approach to this blogging lark, I thought I’d turn my hand to some linguistic analysis. This presents a problem in the form of my being a qualified mathematician and therefore acutely aware of how easy it is to skew any set of data based on the interpretation it’s given, and thus how pointless it becomes to really bother. Nevertheless, I shall sally forth into the Sherlock Holmes canon with a quick sweep over some of the main points, and I can always come back to it later if I feel it warrants further investigation.
All numbers come from a combination of internet-based textual analysis tool Voyant and the texts of Conan Doyle’s Holmes canon as provided at The Complete Sherlock Holmes website. Some numbers have been slightly simplified for reasons that are too dull to go into here, but feel free to pull me up on my maths in the comments if any of it seems ridiculously suspicious.
The entire canon is some 670,000 words composed of around 25,000 unique terms – or, to put it another way, you can write a list of every word ever used by Conan Doyle in writing the canon and it would only be 25,000 items long. All told this is a fairly prodigious vocabulary: famed Elizabethan wordsmith Wm. Shakespeare (putting all authorship contentions aside for the time being) is credited with around 884,000 words – practically a third as much again as Conan Doyle – yet only has around 28,000 distinct words, or around one eighth more. Now, of course, here we stumble upon our first statistical problem: there were fewer words in the English language back then (there’s quite a broad disagreement as to how many…), but then equally Shakespeare invented a great many words (‘eyeball’ for one) and so wouldn’t be restricted in quite the same way as his contemporaries. To put Conan Doyle’s written expression into another context, there are around 250,000 recognised (in use and defunct) words in the English language today; so around 100 years ago he was already using 10% of the modern version of the language. Consider the number of words that have been added in that time – robot, computer, internet, unleaded, twerking – and the rate at which English has expanded and that’s a pretty impressive sweep.
Nevertheless, the stories were and have remained hugely popular and so must be accessible. This means that Conan Doyle’s prose must have been easy to read and must have contained ideas that are easy to understand even over a century later. Here are cirrus displays produced by Voyant of the most popular words in each of the Holmes novels or collections, filtered to remove the most common conjunctions (and, it, of, the, to, etc.), around fifty common words (if, is, with, etc.) and all numbers. I have also included (without these exclusions) the approximate total words for each book and the percentage of unique words in each book; to put this in context, a 100-word story with 10% of the words being unique would effectively be composed of the same ten words repeated ten times each:
A Study in Scarlet (1887)
44,000 words; 14% unique
The Sign of Four (1890)
44,000 words; 14% unique
The Adventures of Sherlock Holmes (1892)
106,000 words; 9% unique
The Memoirs of Sherlock Holmes (1893)
88,000 words; 9% unique
The Hound of the Baskervilles (1901-1902)
60,000 words; 10% unique
The Return of Sherlock Holmes (1905)
114,000 words; 8% unique
The Valley of Fear (1914-1915)
58,000 words; 11% unique
His Last Bow (1917)
69,000 words; 11% unique
The Case-Book of Sherlock Holmes (1927)
84,000 words; 10% unique
Mainly they’re nice images, but even a quick glance through them shows the same words coming up time and again in larger font and hence being more frequent in each text. ‘Holmes’ is perhaps unsurprising; it is never any lower than the fourth most popular word (given the aforementioned exclusions) in any text, and only as low as fourth once, in The Hound of the Baskervilles, which he’s barely in anyway. Perhaps also unsurprising is the preponderance of ‘man’ – always in the top three most frequent, with ‘woman’ and its synonyms always a good few places lower (I’m afraid I didn’t carry out a rigorous check on this point) though becoming more popular as the stories go on – you may draw your own conclusions. And ‘said’ is unquestionably Conan Doyle’s speech attribution verb of choice – yes, people have been known to ‘ejaculate’ in the Holmes stories, but most of the time he appears to have followed the advice of William Strunk, Jr. and kept to simple forms as it’s always one of the top two most frequently used words.
A few surprises crop up – ‘little’ is one of the ten most frequent words in five of the books, as is ‘time’ in seven of them – but mainly things follow a fairly clear pattern. If you exclude character names and story-specific pronouns (Baskerville, Dartmoor, etc.), the ten most popular words from each of the nine books when collated is actually a list of only 23 words. Ordinarily this wouldn’t be too surprising – ‘the’ is the most common word in English and any meaningful subsection thereof, making up around 6% of everything written – but remember that I’ve excluded the most common standard words (I’ll save you the process of how) to focus on the words specific to Conan Doyle’s writing. And in a series that ran for 40 years – while not continuously – he maintained the same focus on the basics of language to get his points across.
As to the percentages: in a paper published in 2005 concerning the word density of texts for younger readers, E.H. Hiebert reported that the optimal density for new words of reader at Grade 2 in the U.S. system (so around age 8 for the rest of us) is between 8 and 11 percent, increasing gradually as the students age. With the exception of the first two Holmes novellas – which, being shorter, would have offered less chance for repetition anyway – Conan Doyle’s writing falls exactly within these bounds. So he’s using a density of repeated terms that is optimal for the average 8 year-old reader…a point in favour of the accessibility of these texts if ever there was one (though, yes, there are naturally several refutations on this point).
Now, we’ll obviously shy away from the correlation/causation argument, but it would be hard to deny that at least a part of the enduring appeal of these stories is their clear-sighted ability to put their points across in accessible language. Yes, the content plays an arguably greater part, but if you can’t understand what’s going on when you read something then you’re unlikely to care how revolutionary it is (I have this exact problem with Wuthering Heights).
I could go on – I feel like I’m only just getting started – but my blog-writing box informs me that I’ve just stepped over 1000 words for the first time in a post and I’ve doubtless already tried the patience of anyone hardy enough to get this far (I at least know that Puzzle Doctor will have wanted to keep an eye on my numbers…). What have we learned? Well, if we’re honest, nothing really. You know that the Holmes stories are written in a direct an accessible style, you know this is part of their appeal; hopefully this reinforces that in some small way. I can already feel the itch to do some contextual analysis of Conan Doyle’s other works and the work of his contemporaries to put this in its appropriate place, but its doubtless a dissertation on the internet somewhere, and I lack the time.
On Tuesday: the final hurrah of Sherlock week with the final impossible crime ‘The Problem of Thor Bridge’. The main problem in this context being that it’s not an impossible crime, but more on that in a couple of days…