Lazyman's Word Count

March 2026

This blog post is in response to this tweet by Marc Brooker. If for some reason, you don’t have access to the tweet, here is the problem statement put forward therein (paraphrased):

Suppose you have 5 MB worth of text data and you want to count the words, how would you do it? What about 5 GB, TB, PB, EB? When would your approach change and why? What if you had to do it once, or once a week, day, hour, or second?

Just as you may guess, there are a tonne of solutions for each of the scenarios above ranging from very simple to the cleverly complex and in this text, we shall try to go over some of them.


1. 5MB (file)


Using your word processor (there is no reward for suffering)
On my machine, using LibreOffice Writer, I can very easily get the number of words in the file above. 723,984. The only issue is that I can’t tell how long it took for the program to do the enumeration but who cares, we have our number.


The OG - wc
Finally, a chance to open my terminal; I didn’t attend 4 years of CS school for nothing!
This is the command - wc -w file.txt. On my old test machine1, this is instant.


Surely, there are other ways to attack this, but good ol’ wc does the job very well. Next!!!


2. 5GB


tip

I could not find a 5Gb file on the internet to test with but thankfully, generating such a file on Linux is an easy job. Given a file with words in cleanwords.txt I can generate a 5Gb file testfile.txt using the following command:

1
shuf -r /tmp/cleanwords.txt | tr '\n' ' ' | head -c 5368709120 > testfile.txt

wc....again
32.5s real time on average as benchmarked using the mighty time linux util. Definitely not WR-shattering numbers, but decent nevertheless. Also, no one said the 5Gb had to be a single file; it would have been multiple files and wc would have worked just fine!


3. 5 (TB | PB | EB)


I won’t even waste time trying to benchmark these. I mean, where am I even going to find 5EB worth of text? 5TB is achievable though, some machines have NVMe storage in the terabytes.


Notable mentions

🤣🤣🤣🤣🤣🤣🤣


Very expensive python approach


Next steps


Now, my answers surely don’t get anyone hired, do they???? If anything, they get you blacklisted on all recruiter lists! In a followup post to this one, I will go over some “real” solutions ranging from single node to possibly distributed variants. Watch this space!



  1. 16Gb, Intel i5-8265U (8) @ 3.900GHz, Ubuntu 24.04.4 LTS ↩︎