A vocabulary of one-thousand words would allow you to recognize nearly 80% of the words in a given English text (Nation, 2006).
Imagine my thrill, a few months into learning Japanese, when I discovered that this applies to all languages. To be a bit more precise, languages (more or less) follow Zipf's Law, which is to say that the most-frequently-occurring word appears twice as often as the second-most-frequently-occuring word, which in turn appears twice as often as the third-most-frequently-occuring word, and so forth. When you do the math, you find that memorizing a couple thousand words would seem to yield what can only be described as a ludicrous amount of value.
The key word here may, unfortunately, be seem.
Those few thousand words will definitely benefit you, and greatly so at that. You'll have gone from knowing nothing of a language to having your foot firmly in the door, but you won't be inside and chilling on the couch quite yet.
If you find yourself fretting over vocabulary words, consider this:
What is a word, anyway?
According to Google's Oxford Languages dictionary, the definition of word is:
A single distinct meaningful element of speech or writing, used with others (or sometimes alone) to form a sentence and typically shown with a space on either side when written or printed.
That's great, but at what point does something become a "distinct and meaningful" element of speech? You might think that run and runs are distinct, because one clearly has an S and the other doesn't, but that's not necessarily the case. Linguistically speaking, words are grouped into what are called word families, and there are seven commonly-recognized levels of word families.
- At level 1, run and runs indeed belong to different word families
- At level 2, eat, eats, ate, and eating would be seen as belonging to the same word family
- At level 6, nation, national, nationally, nationwide, nations, nationalism, nationalisms, internationalism, internationalisms, internationalisation, nationalist, nationalists, nationalistic, nationalistically, internationalist, internationalists, nationalise, nationalised, nationalising, nationalisation, nationalisations, nationalize, nationalized, nationalizing, nationalization, nationhood, and nationhoods all belong to the same word family (p67)
You might see what's happening here.
Eating and ate are merely forms of the verb to eat, so from level 2 and onwards, they're no longer considered to be separate words. As you move up the levels, increasingly more inflections and affixes become fair game. The idea here is that there is really just one headword (also known as a lemma) which is being modified in predictable ways by a series of recycled inflections and affixes. Nationalisation, like ate to to eat, is merely a form of the word nation.
The "1,000 words = ~80% text coverage" statement from the beginning of the article assumes you're operating on level 6.
Now, nation is a particularly scary example. Most words aren't like that. Nevertheless, keep in mind that it might take more than 1,000 flashcards to really cover your 1,000 words.
Other things that complicate our word count
There are a number of things that may complicate our one-word-equals-one-flashcard dreams. Consider also:
- Multiword expressions: There are many "fixed" expressions that exist — we glue certain words together and then always use them in a certain way. The phrase raise your voice refers to a single specific action. Like an individual word, this phrase is indivisible. If you chop it up, you lose its meaning. Even if you've already learned the words raise, your, and voice, you likely wouldn't understand that raise your voice has to do with anger.
- Collocations and idioms: Certain words commonly appear together. The fact that you know the word train doesn't mean you know that trains run down the tracks, rather than sliding down them. Rain is heavy in English, but it's big in Mandarin. There are many situations in which you'll know a word, but won't be completely certain how it's used.
- Polysemy: Words often have multiple meanings. Consider the word out, for example. It means something different when talking about baseball, an LGBT person, and somebody's location. It's also in tons of phrasal verbs like cry out or bleed out. You're not necessarily done with a word just because you've learned how it works in one single context.
Takeaways for learners:
- The translation of a word is really the bare minimum of things you could know about it. As you become more fluent, you'll want to know not only the words themselves but also their hypernyms and hyponyms, their associative meanings, their collocations, and so much more. A lot of that you'll learn naturally simply by consuming the language.
Text coverage =/= Text comprehension
Let's say you toughed it out and learned enough vocabluary words to recognize 80% of the words in a given text — however many flashcards that ended up being. That's great, right? An 80% is a B, which is a respectful grade. If you understand 4 out of 5 things, you understand most things.
Those less-frequently-ocurring 20% of words tend to be more information dense than their frequently-occurring counterparts (see the discussion on page 24). You miss a lot if you don't know them. Consider the following example, in which you know 5 of 6 words:
I'm going to the ___ tonight.
You know everything except where I'm going, which is arguably the most important part.
This is pretty indicative of what initial attempts to read might feel like. You'll often have an idea of what's going on, but your understanding will be spotty: you're missing stuff. Important stuff. Indeed, it's spotty enough that in a 2000 study, zero of 66 students were able to pass a reading comprehension test when 20% of the words of an assigned text were swapped out with nonsense ones.
Hard to believe? Try it yourself.
The paragraph below has been edited by Sinosplice to consist of 20% nonsense words. This is what it feels like to recognize 80% of the words in a text.
“Bingle for help!” you shout. “This loopity is dying!” You put your fingers on her neck. Nothing. Her flid is not weafling. You take out your joople and bingle 119, the emergency number in Japan. There’s no answer! Then you muchy that you have a new befourn assengle. It’s from your gutring, Evie. She hunwres at Tokyo University. You play the assengle. “…if you get this…” Evie says. “…I can’t vickarn now… the important passit is…” Suddenly, she looks around, dingle. “Oh no, they’re here! Cripett… the frib! Wasple them ON THE FRIB!…” BEEP! the assengle parantles. Then you gratoon something behind you…
Now, don't get me wrong. While spotty, the above is an absolutely massive improvement over not understanding a foreign language at all. It's a decisive foot in the door and something to be proud of... but it's also far from ideal. If you'd set out expecting your 2,000 words to be basically good enough, you'd likely be quite disappointed to end up with this result.
Takeaways for learners:
- Don't overestimate the value of 80%. The average length of a sentence in Harry Potter is 12 words. If you recognize 80% of words that are occurring, you're missing 2-3 words per sentence. That's a lot.
- Don't underestimate the value of 80%. It takes 1,000 words to achieve it... but achieving 98% text coverage may take 8,000 more words. Consider following a course early on while you're still getting a lot of bang for your buck, but once things slow down, start focusing more on your personal needs.
- Here's a text-dump of the unique characters appearing in Harry Potter and the Sorceror's Stone. Of ~1,800 total unique characters, ~80 appear more than 100 times and ~250 appear only once. Use your best judgment about which words you should learn now and which you sh0uld look up and move on.
You should be practicing by doing as much as possible
Vocabulary is domain-specific, which is to say that a given vocabulary word is more likely to appear in some places than others. It'll be easier to see the implications of that statement if we approach it in a hands-on fashion.
Here is a a list of the 10,000 Japanese words used most frequently across 10 years of Asahi Shinbun newspapers, dubbed the Core 10k. Here's a list of the Japanese words used most frequently in Japanese wikipedia articles.
Open both in different tabs and just compare. (Both are quite large, so give it a minute to load). Here are a few things you might notice:
- Words like ページ (page)、編集 (edit) and 削除 (delete) are within Japanese's top 50 words, according to Wikipedia. That should make sense! These words appear multiple times on every single Wikipedia page.
- Numbers are the most common words according to the Core 10k, which should also make sense — every article in a newspaper includes a date.
- Wikipedia considers grammatical particles like の and に to be words, but the Core 10k does not. These would be your first flashcards according to Wikipedia, but would never appear independently in the Core 10k.
- The word どうぞ is #56 in the Core 10k... but #2,699 in the Wikipedia list.
These differences occur because of the scope of topics discussed in both mediums and the conventions of how language gets used in each.
To give an example of what that means for learners: none of the variations of お疲れ (what you say to colleagues after a day's work) ever appear in either list. That should make sense. お疲れ (otsukare) is a word used in speech/dialogue, but the language used in the Asahi newspaper and on Wikipedia is a more formal written register. You'll bid farewell to your colleagues every single day after work, but there's probably never a situation in which you'll do so in the middle of an article on fiscal policy.
Takeaways for learners:
- To ensure that you're developing the skills you need to do what you want, practice by doing. You'll be exposed to exactly the language needed to do that particular task in your target language.
- Expect growing pains when you hop mediums. The fact that you can comfortably handle daily conversations doesn't mean that you can give a business presentation. The fact that you can comfortably read self-development books doesn't mean that you can comfortably read science-fiction novels. Each new task and content medium comes with its own learning curve.
Why you shouldn't worry about all this that much
Consider native English speakers that make/made a living with their language. Authors like Stephen King or Amy Hempell; comedians like George Carlin or Kevin Hart. Perhaps you've read poetry by Margaret Atwood or Maya Angelou or watched movies directed by Martin Scorcese or Wes Anderson.
These people did not slave away at their desks until they were perfect, only then bursting into stardom. Stephen King, in fact, does not like the first story he published. That should make sense. His first story was his first story, and now he's had a lifetime to continue writing and improve upon his craft.
Learning a foreign language is no different. You've got to use it.
So, do learn those first thousand or two thousand words... but know that your goal with these initial efforts isn't fluency. You're simply striving to reach a point where you can begin stumbling through a conversation or your first book, and from there, by doing those things, you'll begin building the unique skills you need to do the things that are important to you.
And from there, after months or (more likely) years of doing the things that are enjoyable and meaningful to you, you'll eventually come to the fluency you're looking for.
Looking for more practical advice?
- How to Improve your Pronunciation in Any Language
- 3 Scientifically-Proven Ways to Improve How You Learn
- Mini-Habits: A More Reliable Way to Build Good Habits
- Follow us on YouTube / Instagram / Facebook / Twitter