Word Frequency Analyzer
Paste any text and instantly see which words appear most often. Get a ranked frequency table, an animated bar chart, an interactive word cloud, lexical diversity score, and optional stop-word filtering across 6 languages. Export results as CSV.
Your ad blocker is preventing us from showing ads
MiniWebtool is free because of ads. If this tool helped you, please support us by going Premium (ad‑free + faster tools), or allowlist MiniWebtool.com and reload.
- Allow ads for MiniWebtool.com, then reload
- Or upgrade to Premium (ad‑free)
About Word Frequency Analyzer
The Word Frequency Analyzer answers a simple question with surprising depth: which words does this text really use the most? Paste any block of prose — a blog post, a transcript, a chapter, a job description, a speech — and it ranks every distinct word by how often it appears, charts the distribution, and renders an interactive word cloud sized by frequency. The tool is built for writers checking for accidental word repetition, SEO specialists looking for natural keyword density, students studying an author's vocabulary, researchers running a quick lexical-diversity sanity check, and translators or linguists exploring an unfamiliar text. Everything runs in your browser or on our server and is never stored.
What makes this analyzer different
- Live preview as you type. The side panel updates unique-word count, total words, TTR (lexical diversity), and the live top 5 instantly — without clicking Analyze. You can iterate filters in seconds.
- Six-language stop-word lists. English, Spanish, French, German, Italian, and Portuguese — curated lists, not bloated dumps. Plus a free-form custom stop-word field for character names, brand names, or boilerplate.
- Square-root scaled word cloud. Most cloud generators size words by raw count, which means the top word can be 50× the height of mid-rank words and visually crushes the cloud. Sqrt scaling keeps the cloud readable and is the industry-standard approach since Wordle (2009).
- The top-3 "podium" view. A glance at the gold/silver/bronze cards tells you the words your text leans on hardest — the first thing to check when you suspect accidental repetition.
- Lexical-diversity metrics. Type-Token Ratio and hapax-legomena count give you a richness score, not just a frequency dump. Short prose with TTR > 0.6 is rich; a TTR under 0.2 in a long document is repetitive.
- One-click CSV export. Download or copy the full ranked table for spreadsheet analysis.
How to use this tool
- Paste your text. Up to 200,000 characters — roughly 30,000 words, the length of a long novel chapter or several blog posts combined.
- Pick a stop-word language. If you do not filter stop words, the top of the table will be "the", "of", "and" — informative once, never again. Choose the language of your text, or pick None for a true raw frequency count.
- Set a minimum word length. Set to 3 or 4 if you want to skip "a", "I", "it", "no". Set to 1 to keep everything.
- Choose how many results to display. Top 50 is the sweet spot for most prose; Top 500 gives you the full long tail.
- Optional toggles. Turn on case-sensitive if you care about "Paris" vs. "paris". Turn on basic lemmatization to collapse "runs", "ran", and "running" into "run". Turn on counting numbers if version numbers, years, and statistics are meaningful in your text.
- Click Analyze. Read the podium, scan the bar-chart table, glance at the cloud, and export the CSV if you want to dig further.
The math behind the metrics
Frequency and percentage
For each distinct word \( w \), the count is the number of times it appears in the kept token list, and the percentage is \( \text{count}(w) / N \) where \( N \) is the kept-token total. The bar width is relative to the most common word so you can see the shape of the distribution at a glance.
Type-Token Ratio (TTR)
\( \text{TTR} = U / N \) where \( U \) is the number of unique words (types) and \( N \) is the total counted tokens. TTR is the simplest measure of lexical diversity. A short news brief typically sits at 0.5–0.7; a long novel sinks to 0.15–0.25 because common words recur. TTR is length-sensitive — long texts always have lower TTR than short ones, so do not compare TTR across documents of wildly different sizes.
Hapax legomena
A hapax legomenon (Greek for "said once") is a word that appears exactly once in the text. The hapax count and hapax percentage are classic signals of vocabulary richness. In Shakespeare's complete works, roughly 14,000 of his 31,000 distinct words are hapax — about 45%. A modern blog post often hits 60% or more hapax because there is not enough text for words to recur.
Word cloud font sizing
The font size for word \( w \) in the cloud uses square-root scaling between the minimum and maximum counts on display:
\( \text{size}(w) = 60\% + 180\% \cdot \dfrac{\sqrt{\text{count}(w)} - \sqrt{\text{min}}}{\sqrt{\text{max}} - \sqrt{\text{min}}} \)
This compresses the dynamic range so a 200× word is roughly 3× the height of a 20× word, not 10×. Without this compression, the cloud is dominated by one or two giant words.
Color-coded frequency tiers
The bars and cloud words are color-coded by rank tier so you can spot the shape of your distribution at a glance:
Use cases
Writers — catching unintended repetition
You will be surprised how often a single word ("quickly", "really", "essentially", a character's name) sneaks to the top of your draft. Paste a chapter and look at the gold-silver-bronze podium. If a content word appears there that you did not consciously emphasize, you have a tic to edit out.
SEO and content marketing
Set the stop-word filter and minimum length, then read the top 25. These are the words search engines will most strongly associate with your page. If they do not match your target keyword cluster, your on-page SEO will underperform. Avoid keyword stuffing — modern algorithms penalize unnatural density. A healthy target is roughly 1–2% for your main keyword.
Literary study and stylistics
Paste a chapter of Dickens vs. Hemingway and compare TTR, hapax percentage, and average word length. The numerical fingerprints of authorial styles are remarkably consistent across their bodies of work — this is the foundation of computational stylometry.
Speech and transcript analysis
Politicians and CEOs have favorite words. Run a speech through the analyzer with stop words removed and the top 15 reveal the messaging strategy. Compare two speeches by the same speaker to see what shifted.
Translation and language learning
When working on a translation, run the source text first to see which content words dominate. Make sure your translation preserves the same emphasis. For learners, picking a 200-word article and running it with no stop-word filtering shows which function words you need to recognize fluently.
Research and academic writing
Many journals expect a controlled vocabulary in abstracts. A frequency check before submission catches accidental jargon overuse. Researchers running corpus-linguistics studies use frequency lists as the starting input for collocation, n-gram, and topic-modeling work — this tool generates that input.
Recommended settings by document type
| Document | Stop words | Min length | Top N | Lemmatize |
|---|---|---|---|---|
| Blog post / article | English (or your language) | 3 | 50 | Off |
| Novel chapter | English | 3 | 100 | On (collapse "runs"/"ran"/"running") |
| Academic paper | English | 4 | 100 | On |
| Tweet thread / short post | None | 1 | 25 | Off |
| SEO research | English | 3 | 50 | On |
| Speech transcript | English | 3 | 25 | Off (you want exact phrasing) |
| Foreign-language text | Match the language | 1 | 50 | Off (English-only lemmatizer) |
Frequently asked questions
What counts as a "word"?
The tokenizer matches one or more Unicode letters, optionally joined by apostrophes or hyphens. So don't, state-of-the-art, and l'ovvio are each one word. Numbers are excluded by default — toggle "Count numbers" on if you want to include them. The tokenizer works across Latin, Cyrillic, Greek, and CJK scripts.
What does the basic lemmatizer do, and what does it not do?
It performs three lightweight transformations: drop possessive 's, collapse common verb endings (-ing, -ed), and simple plurals (-s, -es, -ies → -y). It does not do full morphological lemmatization (better → good, went → go). Full lemmatization would require shipping the WordNet lexicon and is overkill for frequency analysis where exact word forms are often what you want to see. The conservative approach also avoids the worst stemmer failure mode: collapsing semantically distinct words ("university" and "universe" share a stem under Porter).
Why do the live preview and the server result differ slightly?
The live preview only filters English stop words client-side to keep the script tiny — other languages get fully filtered on the server. The server also applies basic lemmatization when toggled. The total token count is always the same between the two.
Does the tool handle non-Latin scripts?
Yes — the tokenizer uses Unicode character classes, so Cyrillic, Greek, Arabic, Hebrew, Chinese, Japanese, and Korean text all tokenize correctly. Chinese and Japanese do not use spaces between words, so each contiguous run of CJK characters is treated as a single "token" — for true word segmentation in those languages you would need a dedicated tokenizer like jieba (Chinese) or MeCab (Japanese).
What is the upper limit on text size?
200,000 characters per run — about 30,000 English words or a typical novel chapter. Beyond that, browser memory and request size become a concern; split your text into smaller passes.
Is my text private?
Yes. The text is processed in memory to render the result page and is never written to disk. The live mini-stats while you type run entirely in your browser. We do not log, store, or analyze the content you paste.
A short history of word frequency analysis
Word frequency lists are among the oldest tools in linguistics. The first machine-generated frequency list of English was Father Roberto Busa's 1949–1980 Index Thomisticus, which counted every word in the works of Thomas Aquinas using IBM punched-card machines — widely considered the founding project of digital humanities. The Brown Corpus (1961) provided the first systematically sampled million-word frequency list of modern American English. Today, every search engine, machine-translation system, large language model, and SEO tool runs on word and token frequency statistics at scale. The same simple Counter-based ranking you see in this tool is the kernel of the field.
Reference this content, page, or tool as:
"Word Frequency Analyzer" at https://MiniWebtool.com// from MiniWebtool, https://MiniWebtool.com/
by miniwebtool team. Updated: May 27, 2026