Skip to content

T-units and n-grams.


“To understand a sentence means to understand a language. To understand a language means to be master of a technique.”

— Wittgenstein, Philosophical Investigations


YOU DON’T HAVE to look very far to find bad writing advice — simply note appeals for clarity and readability in our academic writing. However well-meaning, these appeals do a disservice to writers who target academic expertise, as those features which reliably predict reading comprehension have either no connection to text quality or are more closely associated – by both expert readers and computational analysis – with low-rated or novice texts.

‘A successful sentence in the academic context is difficult by virtue of the fact that it embeds both complex syntax and a sophisticated lexicon.’

The linguistic features which do correlate to successful academic writing reliably predict quite the opposite of clarity and readability. That is, a successful sentence in the academic context is difficult by virtue of the fact that it embeds both complex syntax and a sophisticated lexicon.

The best way to gauge syntactic complexity is by starting with something called the T-unit. Coined by Kellogg Hunt in 1965, a T-unit is the shortest possible grammatical unit that can stand alone as a sentence. Sometimes a T-unit equates to the punctuated sentence, as in The cat sat on the mat/ because it was hungry. But more often than not, a T-unit correlates to the main clause and its subordinating clauses as in The cat sat on the mat/ because it was hungry (1) but its owner didn’t come home (2). Any coordinating clauses are counted as separate T-units.  

Research in applied linguistics, syntax, and second language writing research has consistently found that longer T-units with more clauses such as The cat sat on the mat/ that had been left on the doorstep/ although it was uncomfortable are more highly rated than shorter T-units with fewer clauses — The cat sat on the mat/ although it was uncomfortable. A recent study showed, for example, a mean difference of about three words between high and low rated T-units.

Likewise, the longer the mean length of the clauses within a particular T-unit, the more successful the punctuated sentence will be: The wet cat gracefully sat on the dry mat/ after the torrential spring rain is better than The cat sat on the mat/ after the rain. The number of words before the main verb in an independent clause is also indicative of quality as sentences with more words before the main verb as in From a theoretical perspective, the cat is are more highly rated than sentences with fewer words before the main verb as in The cat sat.

Complex noun groups, however, mark the most noteworthy difference between successful and unsuccessful sentences. Noun groups or nominals are defined as a noun modified by other words and include possessives — the cat’s grey whiskers, participles – the purring cat, adjectives such as the very fat cat, prepositional modifiers — the night habits of cats on mats, and relative clauses — the cats that drank and stank. Studies show that low rated sentences contain fewer possessives, participles, prepositional modifiers, and relative clauses in comparison to more successful sentences. The length of the nominal group is also significant with longer phrases such as the Union of Cats with Short Whiskers scoring more highly than shorter phrases such as the Union of Cats.

Like syntactic complexity, lexical sophistication is multidimensional and involves the number of words per sentence (good sentences have more), the use of a diverse range of words in a sentence (good sentences have more diverse words), the use of rare or low frequency words, the use of n-grams, the use of longer words, as well as higher instances of abstract words.

Low-frequency words are words that occur less in any given context – The felines convened on the floor-covering. These words are judged by expert readers to be more sophisticated than commonly occurring words like cat, sat, and mat. N-grams are words that typically work together in the academic context. Note the n-grams as can be seen and in order to in the following sentence – As can be seen, the cat sat on the mat in order to nap. N-grams span all parts of speech and are ubiquitous in high quality sentences. Longer words – the cat simultaneously sat – are also much more highly valued than shorter words as are abstract words such as the intuitive cat, the credible cat, and the impossible mat. There is also some evidence that there are more discipline specific words deployed in high rated academic sentences.       

So what does all of this mean for writers and teachers of writing? It means that The establishment of the Union of Cats that sit on Mats was an important development in the historic struggle for feline rights is always going to score more highly in the academic context than, say, The cat sat on the mat. And to suggest that academic writing is anything other than syntactically complex and lexically rich is to deny writers access to a proper understanding of the academic sentence. And this, in turn, denies writers access to the mastery they seek.

Davina Allison has a PhD in Text Linguistics and teaches at the Queensland University of Technology. Her poems have been published in journals such as the Australian Book Review, Poetry ScotlandThe Australian Poetry JournalThe Glasgow Review of BooksThe Well Review, and The London Magazine.

Notify of

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Inline Feedbacks
View all comments
Would love your thoughts, please comment.x