What does this tool do?

It reads any block of text, counts every distinct word, and ranks them from most to least frequent. You see the ranked table, an animated bar chart sized to the top word, a word cloud with sqrt-scaled font sizes, and lexical-diversity stats. You can download the results as CSV.

Why filter stop words?

In English the top 5 words ('the', 'of', 'and', 'a', 'to') drown out everything that actually carries meaning. Stop-word lists strip these high-frequency function words so the report surfaces the words your text is really about. We ship curated lists for English, Spanish, French, German, Italian, and Portuguese, and you can add custom stops such as character names or product brands.

What is the Type-Token Ratio?

Type-Token Ratio (TTR) = unique words / total words. It is a classic measure of lexical diversity. A short news article runs around 0.5-0.7, a long novel often falls to 0.1-0.2 because common words recur. Higher TTR means richer vocabulary at the level you analyzed.

Why is the word cloud sized by square root, not by count?

If the top word appears 200 times and the next word 20 times, linear sizing makes the top word 10× the height — it visually crushes the cloud and you cannot read anything else. Square-root scaling compresses the difference to roughly 3× so the cloud stays readable while still emphasizing dominant words. This is how serious word-cloud tools have always worked.

Word Frequency Analyzer

Paste any text and instantly see which words appear most often. Get a ranked frequency table, an animated bar chart, an interactive word cloud, lexical diversity score, and optional stop-word filtering across 6 languages. Export results as CSV.

📚 Try an example

Your text Paste any prose — an article, an essay, a transcript, even a full novel chapter. Up to 200,000 characters per run.

Stop-word list

Minimum word length

Show top

Custom stop words (optional) Add character names, brand names, or any words you want filtered out — separated by commas or spaces.

Case-sensitive Basic lemmatize (runs → run) Count numbers

Embed Word Frequency Analyzer Widget

About Word Frequency Analyzer

The Word Frequency Analyzer answers a simple question with surprising depth: which words does this text really use the most? Paste any block of prose — a blog post, a transcript, a chapter, a job description, a speech — and it ranks every distinct word by how often it appears, charts the distribution, and renders an interactive word cloud sized by frequency. The tool is built for writers checking for accidental word repetition, SEO specialists looking for natural keyword density, students studying an author's vocabulary, researchers running a quick lexical-diversity sanity check, and translators or linguists exploring an unfamiliar text. Everything runs in your browser or on our server and is never stored.

What makes this analyzer different

Live preview as you type. The side panel updates unique-word count, total words, TTR (lexical diversity), and the live top 5 instantly — without clicking Analyze. You can iterate filters in seconds.
Six-language stop-word lists. English, Spanish, French, German, Italian, and Portuguese — curated lists, not bloated dumps. Plus a free-form custom stop-word field for character names, brand names, or boilerplate.
Square-root scaled word cloud. Most cloud generators size words by raw count, which means the top word can be 50× the height of mid-rank words and visually crushes the cloud. Sqrt scaling keeps the cloud readable and is the industry-standard approach since Wordle (2009).
The top-3 "podium" view. A glance at the gold/silver/bronze cards tells you the words your text leans on hardest — the first thing to check when you suspect accidental repetition.
Lexical-diversity metrics. Type-Token Ratio and hapax-legomena count give you a richness score, not just a frequency dump. Short prose with TTR > 0.6 is rich; a TTR under 0.2 in a long document is repetitive.
One-click CSV export. Download or copy the full ranked table for spreadsheet analysis.

How to use this tool

Paste your text. Up to 200,000 characters — roughly 30,000 words, the length of a long novel chapter or several blog posts combined.
Pick a stop-word language. If you do not filter stop words, the top of the table will be "the", "of", "and" — informative once, never again. Choose the language of your text, or pick None for a true raw frequency count.
Set a minimum word length. Set to 3 or 4 if you want to skip "a", "I", "it", "no". Set to 1 to keep everything.
Choose how many results to display. Top 50 is the sweet spot for most prose; Top 500 gives you the full long tail.
Optional toggles. Turn on case-sensitive if you care about "Paris" vs. "paris". Turn on basic lemmatization to collapse "runs", "ran", and "running" into "run". Turn on counting numbers if version numbers, years, and statistics are meaningful in your text.
Click Analyze. Read the podium, scan the bar-chart table, glance at the cloud, and export the CSV if you want to dig further.

The math behind the metrics

Frequency and percentage

For each distinct word \( w \), the count is the number of times it appears in the kept token list, and the percentage is \( \text{count}(w) / N \) where \( N \) is the kept-token total. The bar width is relative to the most common word so you can see the shape of the distribution at a glance.

Type-Token Ratio (TTR)

\( \text{TTR} = U / N \) where \( U \) is the number of unique words (types) and \( N \) is the total counted tokens. TTR is the simplest measure of lexical diversity. A short news brief typically sits at 0.5–0.7; a long novel sinks to 0.15–0.25 because common words recur. TTR is length-sensitive — long texts always have lower TTR than short ones, so do not compare TTR across documents of wildly different sizes.

Hapax legomena

A hapax legomenon (Greek for "said once") is a word that appears exactly once in the text. The hapax count and hapax percentage are classic signals of vocabulary richness. In Shakespeare's complete works, roughly 14,000 of his 31,000 distinct words are hapax — about 45%. A modern blog post often hits 60% or more hapax because there is not enough text for words to recur.

Word cloud font sizing

The font size for word \( w \) in the cloud uses square-root scaling between the minimum and maximum counts on display:

\( \text{size}(w) = 60\% + 180\% \cdot \dfrac{\sqrt{\text{count}(w)} - \sqrt{\text{min}}}{\sqrt{\text{max}} - \sqrt{\text{min}}} \)

This compresses the dynamic range so a 200× word is roughly 3× the height of a 20× word, not 10×. Without this compression, the cloud is dominated by one or two giant words.

Color-coded frequency tiers

The bars and cloud words are color-coded by rank tier so you can spot the shape of your distribution at a glance:

Tier 1 — ranks 1–5The 5 words your text leans on hardest. If a content word lands here, that is your theme.

Tier 2 — ranks 6–15The supporting cast. Recurring nouns and verbs you use to develop the main idea.

Tier 3 — ranks 16–40The wider vocabulary surrounding your top themes.

Tier 4 — ranks 41–100Specialist or specific terms — proper nouns, jargon, named entities.

Tier 5 — ranks 101+The long tail. Words used once or twice. Often where the most interesting vocabulary lives.

Use cases

Writers — catching unintended repetition

You will be surprised how often a single word ("quickly", "really", "essentially", a character's name) sneaks to the top of your draft. Paste a chapter and look at the gold-silver-bronze podium. If a content word appears there that you did not consciously emphasize, you have a tic to edit out.

SEO and content marketing

Set the stop-word filter and minimum length, then read the top 25. These are the words search engines will most strongly associate with your page. If they do not match your target keyword cluster, your on-page SEO will underperform. Avoid keyword stuffing — modern algorithms penalize unnatural density. A healthy target is roughly 1–2% for your main keyword.

Literary study and stylistics

Paste a chapter of Dickens vs. Hemingway and compare TTR, hapax percentage, and average word length. The numerical fingerprints of authorial styles are remarkably consistent across their bodies of work — this is the foundation of computational stylometry.

Speech and transcript analysis

Politicians and CEOs have favorite words. Run a speech through the analyzer with stop words removed and the top 15 reveal the messaging strategy. Compare two speeches by the same speaker to see what shifted.

Translation and language learning

When working on a translation, run the source text first to see which content words dominate. Make sure your translation preserves the same emphasis. For learners, picking a 200-word article and running it with no stop-word filtering shows which function words you need to recognize fluently.

Research and academic writing

Many journals expect a controlled vocabulary in abstracts. A frequency check before submission catches accidental jargon overuse. Researchers running corpus-linguistics studies use frequency lists as the starting input for collocation, n-gram, and topic-modeling work — this tool generates that input.

Document	Stop words	Min length	Top N	Lemmatize
Blog post / article	English (or your language)	3	50	Off
Novel chapter	English	3	100	On (collapse "runs"/"ran"/"running")
Academic paper	English	4	100	On
Tweet thread / short post	None	1	25	Off
SEO research	English	3	50	On
Speech transcript	English	3	25	Off (you want exact phrasing)
Foreign-language text	Match the language	1	50	Off (English-only lemmatizer)

Frequently asked questions

What counts as a "word"?

The tokenizer matches one or more Unicode letters, optionally joined by apostrophes or hyphens. So don't, state-of-the-art, and l'ovvio are each one word. Numbers are excluded by default — toggle "Count numbers" on if you want to include them. The tokenizer works across Latin, Cyrillic, Greek, and CJK scripts.

What does the basic lemmatizer do, and what does it not do?

It performs three lightweight transformations: drop possessive 's, collapse common verb endings (-ing, -ed), and simple plurals (-s, -es, -ies → -y). It does not do full morphological lemmatization (better → good, went → go). Full lemmatization would require shipping the WordNet lexicon and is overkill for frequency analysis where exact word forms are often what you want to see. The conservative approach also avoids the worst stemmer failure mode: collapsing semantically distinct words ("university" and "universe" share a stem under Porter).

Why do the live preview and the server result differ slightly?

The live preview only filters English stop words client-side to keep the script tiny — other languages get fully filtered on the server. The server also applies basic lemmatization when toggled. The total token count is always the same between the two.

Does the tool handle non-Latin scripts?

Yes — the tokenizer uses Unicode character classes, so Cyrillic, Greek, Arabic, Hebrew, Chinese, Japanese, and Korean text all tokenize correctly. Chinese and Japanese do not use spaces between words, so each contiguous run of CJK characters is treated as a single "token" — for true word segmentation in those languages you would need a dedicated tokenizer like jieba (Chinese) or MeCab (Japanese).

What is the upper limit on text size?

200,000 characters per run — about 30,000 English words or a typical novel chapter. Beyond that, browser memory and request size become a concern; split your text into smaller passes.

Is my text private?

Yes. The text is processed in memory to render the result page and is never written to disk. The live mini-stats while you type run entirely in your browser. We do not log, store, or analyze the content you paste.

A short history of word frequency analysis

Word frequency lists are among the oldest tools in linguistics. The first machine-generated frequency list of English was Father Roberto Busa's 1949–1980 Index Thomisticus, which counted every word in the works of Thomas Aquinas using IBM punched-card machines — widely considered the founding project of digital humanities. The Brown Corpus (1961) provided the first systematically sampled million-word frequency list of modern American English. Today, every search engine, machine-translation system, large language model, and SEO tool runs on word and token frequency statistics at scale. The same simple Counter-based ranking you see in this tool is the kernel of the field.

Reference this content, page, or tool as:

"Word Frequency Analyzer" at https://MiniWebtool.com/word-frequency-analyzer/ from MiniWebtool, https://MiniWebtool.com/

by miniwebtool team. Updated: May 27, 2026

Developer API available: Run this tool from your app, automation, or agent with one JSON HTTP request. View API docs

Word Frequency Analyzer

About Word Frequency Analyzer

What makes this analyzer different

How to use this tool

The math behind the metrics

Frequency and percentage

Type-Token Ratio (TTR)

Hapax legomena

Word cloud font sizing

Color-coded frequency tiers

Use cases

Writers — catching unintended repetition

SEO and content marketing

Literary study and stylistics

Speech and transcript analysis

Translation and language learning

Research and academic writing

Recommended settings by document type

Frequently asked questions

What counts as a "word"?

What does the basic lemmatizer do, and what does it not do?

Why do the live preview and the server result differ slightly?

Does the tool handle non-Latin scripts?

What is the upper limit on text size?

Is my text private?

A short history of word frequency analysis

Text Statistics Tools:

Top & Updated: