#35: The Language of Sherlock Holmes – A Study in Consistency

Holmes profile

In an attempt to broaden my approach to this blogging lark, I thought I’d turn my hand to some linguistic analysis.  This presents a problem in the form of my being a qualified mathematician and therefore acutely aware of how easy it is to skew any set of data based on the interpretation it’s given, and thus how pointless it becomes to really bother.  Nevertheless, I shall sally forth into the Sherlock Holmes canon with a quick sweep over some of the main points, and I can always come back to it later if I feel it warrants further investigation.

All numbers come from a combination of internet-based textual analysis tool Voyant and the texts of Conan Doyle’s Holmes canon as provided at The Complete Sherlock Holmes website.  Some numbers have been slightly simplified for reasons that are too dull to go into here, but feel free to pull me up on my maths in the comments if any of it seems ridiculously suspicious.

The entire canon is some 670,000 words composed of around 25,000 unique terms – or, to put it another way, you can write a list of every word ever used by Conan Doyle in writing the canon and it would only be 25,000 items long.  All told this is a fairly prodigious vocabulary: famed Elizabethan wordsmith Wm. Shakespeare (putting all authorship contentions aside for the time being) is credited with around 884,000 words – practically a third as much again as Conan Doyle – yet only has around 28,000 distinct words, or around one eighth more.  Now, of course, here we stumble upon our first statistical problem: there were fewer words in the English language back then (there’s quite a broad disagreement as to how many…), but then equally Shakespeare invented a great many words (‘eyeball’ for one) and so wouldn’t be restricted in quite the same way as his contemporaries.  To put Conan Doyle’s written expression into another context, there are around 250,000 recognised (in use and defunct) words in the English language today; so around 100 years ago he was already using 10% of the modern version of the language.  Consider the number of words that have been added in that time – robot, computer, internet, unleaded, twerking – and the rate at which English has expanded and that’s a pretty impressive sweep.

Nevertheless, the stories were and have remained hugely popular and so must be accessible.  This means that Conan Doyle’s prose must have been easy to read and must have contained ideas that are easy to understand even over a century later.  Here are cirrus displays produced by Voyant of the most popular words in each of the Holmes novels or collections, filtered to remove the most common conjunctions (and, it, of, the, to, etc.), around fifty common words (if, is, with, etc.) and all numbers.  I have also included (without these exclusions) the approximate total words for each book and the percentage of unique words in each book; to put this in context, a 100-word story with 10% of the words being unique would effectively be composed of the same ten words repeated ten times each:

A Study in Scarlet (1887)

44,000 words; 14% unique

1. Cirrus SISThe Sign of Four (1890)

44,000 words; 14% unique

2. Cirrus SoFThe Adventures of Sherlock Holmes (1892)

106,000 words; 9% unique

3. Cirrus Adv

The Memoirs of Sherlock Holmes (1893)

88,000 words; 9% unique

4. Cirrus Mem

The Hound of the Baskervilles (1901-1902)

60,000 words; 10% unique

5. Cirrus HotBThe Return of Sherlock Holmes (1905)

114,000 words; 8% unique

6. Cirrus Return

The Valley of Fear (1914-1915)

58,000 words; 11% unique

7. Cirrus VoF

His Last Bow (1917)

69,000 words; 11% unique

8. Cirrus HLB

The Case-Book of Sherlock Holmes (1927)

84,000 words; 10% unique

9. Cirrus CB

Mainly they’re nice images, but even a quick glance through them shows the same words coming up time and again in larger font and hence being more frequent in each text.  ‘Holmes’ is perhaps unsurprising; it is never any lower than the fourth most popular word (given the aforementioned exclusions) in any text, and only as low as fourth once, in The Hound of the Baskervilles, which he’s barely in anyway.  Perhaps also unsurprising is the preponderance of ‘man’ – always in the top three most frequent, with ‘woman’ and its synonyms always a good few places lower (I’m afraid I didn’t carry out a rigorous check on this point) though becoming more popular as the stories go on – you may draw your own conclusions.  And ‘said’ is unquestionably Conan Doyle’s speech attribution verb of choice – yes, people have been known to ‘ejaculate’ in the Holmes stories, but most of the time he appears to have followed the advice of William Strunk, Jr. and kept to simple forms as it’s always one of the top two most frequently used words.

A few surprises crop up – ‘little’ is one of the ten most frequent words in five of the books, as is ‘time’ in seven of them – but mainly things follow a fairly clear pattern.  If you exclude character names and story-specific pronouns (Baskerville, Dartmoor, etc.), the ten most popular words from each of the nine books when collated is actually a list of only 23 words.  Ordinarily this wouldn’t be too surprising – ‘the’ is the most common word in English and any meaningful subsection thereof, making up around 6% of everything written – but remember that I’ve excluded the most common standard words (I’ll save you the process of how) to focus on the words specific to Conan Doyle’s writing.  And in a series that ran for 40 years – while not continuously – he maintained the same focus on the basics of language to get his points across.

As to the percentages: in a paper published in 2005 concerning the word density of texts for younger readers, E.H. Hiebert reported that the optimal density for new words of reader at Grade 2 in the U.S. system (so around age 8 for the rest of us) is between 8 and 11 percent, increasing gradually as the students age.  With the exception of the first two Holmes novellas – which, being shorter, would have offered less chance for repetition anyway – Conan Doyle’s writing falls exactly within these bounds.  So he’s using a density of repeated terms that is optimal for the average 8 year-old reader…a point in favour of the accessibility of these texts if ever there was one (though, yes, there are naturally several refutations on this point).

Now, we’ll obviously shy away from the correlation/causation argument, but it would be hard to deny that at least a part of the enduring appeal of these stories is their clear-sighted ability to put their points across in accessible language.  Yes, the content plays an arguably greater part, but if you can’t understand what’s going on when you read something then you’re unlikely to care how revolutionary it is (I have this exact problem with Wuthering Heights).

I could go on – I feel like I’m only just getting started – but my blog-writing box informs me that I’ve just stepped over 1000 words for the first time in a post and I’ve doubtless already tried the patience of anyone hardy enough to get this far (I at least know that Puzzle Doctor will have wanted to keep an eye on my numbers…).  What have we learned?  Well, if we’re honest, nothing really.  You know that the Holmes stories are written in a direct an accessible style, you know this is part of their appeal; hopefully this reinforces that in some small way.  I can already feel the itch to do some contextual analysis of Conan Doyle’s other works and the work of his contemporaries to put this in its appropriate place, but its doubtless a dissertation on the internet somewhere, and I lack the time.

On Tuesday: the final hurrah of Sherlock week with the final impossible crime ‘The Problem of Thor Bridge’.  The main problem in this context being that it’s not an impossible crime, but more on that in a couple of days…

15 thoughts on “#35: The Language of Sherlock Holmes – A Study in Consistency

  1. Really enjoyed reading this post and it must have taken a lot of time and effort to do the cirrus displays (being an English Literature degree person I was going to call them the nice wordy pictures but I figured a mathematically minded person might baulk at that). Something that stood out for me in your analysis was the fact that the language used is very concrete as opposed to abstract and there is not a great deal of emotions expressed apart from the odd ‘cried’. This might of course be a way of making it an easier read, but I also wondered whether this is another way detective literature at this time, was trying to steer away from sensation fiction a bit and emphasis instead the idea of physical clues, reasoning, logic and data. I suppose also the precise nature of the language links into the more scientific approach Holmes tries to use in his investigation (which I believe Ian Ousby in his book Bloodhounds of Heaven picks up on). The vogue for scientific detectives at the time would suggest Doyle wasn’t the only one doing this. On a final note originally A Study in Scarlet was going to be called A Tangled Skein – which could be classed as weaker but definitely more abstract title, but the fact it wasn’t used and a much more simpler, concrete title was used instead I think reflects the sensing and concrete language used in the stories. Sorry for the blithering on so much, hadn’t quite realised I had written so much!

    Liked by 1 person

    • That’s not blithering at all, Kate, you make a series of excellent points. I have very little experience with late-Victorian literature full stop, but am particularly ignorant of the context of sensationalist writing from that time and you’re quite correct about the lack of fully emotive language in evidence (these not being the types of words I excluded). Reasoning and logic certainly seem to be the touchstones of the Holmes series, and ‘think’ is a popular word in the canon – whereas ‘feel’ or ‘imagine’ and their equivalent forms come decidely further down. It would be intersting to see how someone like R. Austin Freeman compares, or to take the text of something sensationalist from the same era (any suggestions?) and see what it turns up under similar conditions.

      Thank-you, this has really got me thinking now!

      Liked by 1 person

      • Although sensation fiction can be a little annoying with its weak kneed ninnys for heroines, as a subgenre I have become increasing conscious of how important it was in contributing to detective fiction, especially with writers like Wilkie Collins and Charles Dickens blurring the boundaries considerably such as The Woman in White and The Bleak House. Think a key commonality between the genres is that they both suggest that respectability can be skin deep and also that murder and other crimes really do begin at home. Sensation fiction though did I think manage to vamp up the figure of the villain and make it more 3 dimensional than melodrama tended to do at the time. The primary difference though is that crime fiction tends to focus on how the crime or mystery is uncovered, whereas in sensation fiction you tend to know the mystery early on but are reading to find out what the bad character does next and how they will get their comeuppance. Another difference of course is the use of detectives as in sensation fiction such as Lady Audley’s Secret it is a relative which does the detecting whereas in crime fiction detectives can be relatives, policeman or amateurs etc. I think there would be a very stark contrast between an example of sensation fiction and a story from Freeman. Don’t know of any sensation fiction short stories off the top of my head but Lady Audley’s Secret would be a good example of sensation fiction tropes and language and is also considerably shorter than The Woman in White, which I swear has nearly 200 pages of padding at the end where the previous 400 odd pages are recounted (despite that though it is a good read in other respects). If you’re interested in finding out a bit more about Victorian crime fiction (melodrama, real life crimes, sensation fiction and detective fiction and how it developed and was intertwined) Judith Flander’s The Invention of Murder is a really good, not too long and informative read and a bit like Martin Edwards’ book it isn’t a list of facts but more of a narrative.

        Liked by 1 person

      • The sensation novelists are worth checking out. Sheridan le Fanu’s WYLDER’S HAND is particularly good and would be a good starting point. He was better known for his superb gothic fiction but he wrote sensation novels as well, and very good ones.

        Wilkie Collins is excellent. Mary Braddon’s LADY AUDLEY’S SECRET is also not bad.

        I’d start with WYLDER’S HAND – there’s less padding.

        Liked by 1 person

  2. Not a mathematician, but as someone who has spent quite some time with (socio-)linguistics, I found this a highly entertaining post. I remember that quite some years ago, researchers also took a similar look at Christie’s work, which (IIRC) showed she used a (relatively) small vocabulary, with sentences of a certain length and other properties that made her writing very accessible to the reader.

    Liked by 1 person

  3. Pingback: #54: The Kings of Crime – IV: Erle Stanley Gardner, the King of Spades | The Invisible Event

  4. Pingback: #84: The Tuesday Night Bloggers – The Sherlockian Impossibilities of John Dickson Carr – II: ‘The Adventure of the Sealed Room’ (1953) | The Invisible Event

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.