Google Ngram Viewer

Last updated
Example of an Ngram query Example of a google Ngram.jpg
Example of an Ngram query

The Google Ngram Viewer or Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in printed sources published between 1500 and 2019 [1] [2] [3] [4] in Google's text corpora in English, Chinese (simplified), French, German, Hebrew, Italian, Russian, or Spanish. [2] [5] There are also some specialized English corpora, such as American English, British English, and English Fiction. [6]

Contents

The program can search for a word or a phrase, including misspellings or gibberish. [5] The n-grams are matched with the text within the selected corpus, optionally using case-sensitive spelling (which compares the exact use of uppercase letters), [7] and, if found in 40 or more books, are then displayed as a graph. [8] The Google Ngram Viewer supports searches for parts of speech and wildcards. [6] It is routinely used in research. [9] [10]

History

The program was developed by Jon Orwant and Will Brockman and released in mid-December 2010. [2] [3] It was inspired by a prototype called Bookworm created by Jean-Baptiste Michel and Erez Aiden from Harvard's Cultural Observatory, Yuan Shen from MIT, and Steven Pinker. [11]

The Ngram Viewer was initially based on the 2009 edition of the Google Books Ngram Corpus. As of July 2020, the program supports 2009, 2012, and 2019 corpora.

Operation and restrictions

Commas delimit user-entered search terms, indicating each separate word or phrase to find. [8] The Ngram Viewer returns a plotted line chart.

As an adjustment for more books having been published during some years, the data are normalized, as a relative level, by the number of books published in each year. [8]

Due to limitations on the size of the Ngram database, only matches found in at least 40 books are indexed in the database. [8]

Limitations

The data set has been criticized for its reliance upon inaccurate OCR, an overabundance of scientific literature, and for including large numbers of incorrectly dated and categorized texts. [12] [13] Because of these errors, and because it is uncontrolled for bias [14] (such as the increasing amount of scientific literature, which causes other terms to appear to decline in popularity), it is risky to use this corpus to study language or test theories. [15] Since the data set does not include metadata, it may not reflect general linguistic or cultural change [16] and can only hint at such an effect.

Guidelines for doing research with data from Google Ngram have been proposed that address many of the issues discussed above. [17]

OCR issues

Optical character recognition, or OCR, is not always reliable, and some characters may not be scanned correctly. In particular, systemic errors like the confusion of s and f in pre-19th century texts (due to the use of ſ, the long s, which was similar in appearance to f) can cause systemic bias. Although Google Ngram Viewer claims that the results are reliable from 1800 onwards, poor OCR and insufficient data mean that frequencies given for languages such as Chinese may only be accurate from 1970 onward, with earlier parts of the corpus showing no results at all for common terms, and data for some years containing more than 50% noise. [18] [19]

See also

Related Research Articles

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

<i>n</i>-gram Item sequences in computational linguistics

An n-gram is a sequence of n adjacent symbols in particular order. The symbols may be n adjacent letters, syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a text corpus or speech corpus. If Latin numerical prefixes are used, then n-gram of size 1 is called a "unigram", size 2 a "bigram" etc. If, instead of the Latin ones, the English cardinal numbers are furtherly used, then they are called "four-gram", "five-gram", etc. Similarly, using Greek numerical prefixes such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc. are used in computational biology, for polymers or oligomers of a known size, called k-mers. When the items are words, n-grams may also be called shingles.

<span class="mw-page-title-main">Google Books</span> Service from Google

Google Books is a service from Google that searches the full text of books and magazines that Google has scanned, converted to text using optical character recognition (OCR), and stored in its digital database. Books are provided either by publishers and authors through the Google Books Partner Program, or by Google's library partners through the Library Project. Additionally, Google has partnered with a number of magazine publishers to digitize their archives.

Linguistic categories include

<span class="mw-page-title-main">Internet linguistics</span> Domain of linguistics

Internet linguistics is a domain of linguistics advocated by the English linguist David Crystal. It studies new language styles and forms that have arisen under the influence of the Internet and of other new media, such as Short Message Service (SMS) text messaging. Since the beginning of human–computer interaction (HCI) leading to computer-mediated communication (CMC) and Internet-mediated communication (IMC), experts, such as Gretchen McCulloch have acknowledged that linguistics has a contributing role in it, in terms of web interface and usability. Studying the emerging language on the Internet can help improve conceptual organization, translation and web usability. Such study aims to benefit both linguists and web users combined.

The International Corpus of English (ICE) is a set of text corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).

A word list is a list of a language's lexicon within some given text corpus, serving the purpose of vocabulary acquisition. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort", but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles has accelerated the research field.

<span class="mw-page-title-main">Mark Davies (linguist)</span> American linguist (born 1963)

Mark E. Davies is an American linguist. He specializes in corpus linguistics and language variation and change. He is the creator of most of the text corpora from English-Corpora.org as well as the Corpus del español and the Corpus do português. He has also created large datasets of word frequency, collocates, and n-grams data, which have been used by many large companies in the fields of technology and also language learning.

Culturomics is a form of computational lexicology that studies human behavior and cultural trends through the quantitative analysis of digitized texts. Researchers data mine large digital archives to investigate cultural phenomena reflected in language and word usage. The term is an American neologism first described in a 2010 Science article called Quantitative Analysis of Culture Using Millions of Digitized Books, co-authored by Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden.

The following outline is provided as an overview of and topical guide to natural-language processing:

Computational social science is an interdisciplinary academic sub-field concerned with computational approaches to the social sciences. This means that computers are used to model, simulate, and analyze social phenomena. It has been applied in areas such as computational economics, computational sociology, computational media analysis, cliodynamics, culturomics, nonprofit studies. It focuses on investigating social and behavioral relationships and interactions using data science approaches, network analysis, social simulation and studies using interactive systems.

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

<span class="mw-page-title-main">Sketch Engine</span> Corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.

In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.

References

  1. "Quantitative analysis of culture using millions of digitized books" JB Michel et al, Science 2011, DOI: 10.1126/science.1199644
  2. 1 2 3 "Google Ngram Database Tracks Popularity Of 500 Billion Words" Huffington Post, 17 December 2010, webpage: HP8150.
  3. 1 2 "Google's Ngram Viewer: A time machine for wordplay", Cnet.com, 17 December 2010, webpage: CN93 Archived 2014-01-23 at the Wayback Machine .
  4. @searchliaison (July 13, 2020). "The Google Books Ngram Viewer has now been updated with fresh data through 2019" (Tweet). Retrieved 2020-08-11 via Twitter.
  5. 1 2 "Google Books Ngram Viewer - University at Buffalo Libraries", Lib.Buffalo.edu, 22 August 2011, webpage: Buf497 Archived 2013-07-02 at the Wayback Machine
  6. 1 2 "Google Books Ngram Viewer info page".
  7. "Google Ngram Viewer - Google Books", Books.Google.com, May 2012, webpage: G-Ngrams.
  8. 1 2 3 4 "Google Ngram Viewer - Google Books" (Information), Books.Google.com, December 16, 2010, webpage: G-Ngrams-info: notes bigrams and use of quotes for words with apostrophes.
  9. Greenfield, Patricia M. (September 2013). "The Changing Psychology of Culture From 1800 Through 2000". Psychological Science. 24 (9): 1722–1731. doi:10.1177/0956797613479387. ISSN   0956-7976. PMID   23925305. S2CID   6123553.
  10. Younes, Nadja; Reips, Ulf-Dietrich (October 2018). "The changing psychology of culture in German-speaking countries: A Google Ngram study: THE CHANGING PSYCHOLOGY OF CULTURE". International Journal of Psychology. 53: 53–62. doi:10.1002/ijop.12428. PMID   28474338. S2CID   7440938.
  11. The RSA (4 February 2010). "Steven Pinker – The Stuff of Thought: Language as a window into human nature" via YouTube.
  12. Google Ngrams: OCR and Metadata Archived 2016-04-27 at the Wayback Machine . ResourceShelf, 19 December 2010
  13. Nunberg, Geoff (16 December 2010). "Humanities research with the Google Books corpus". Archived from the original on 10 March 2016.
  14. Pechenick, Eitan Adam; Danforth, Christopher M.; Dodds, Peter Sheridan; Barrat, Alain (7 October 2015). "Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution". PLOS ONE. 10 (10): e0137041. arXiv: 1501.00960 . Bibcode:2015PLoSO..1037041P. doi: 10.1371/journal.pone.0137041 . PMC   4596490 . PMID   26445406.
  15. Zhang, Sarah. "The Pitfalls of Using Google Ngram to Study Language". WIRED. Retrieved 2017-05-24.
  16. Koplenig, Alexander (2015-09-02). "The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets—Reconstructing the composition of the German corpus in times of WWII". Digital Scholarship in the Humanities. 32 (1) (published 2017-04-01): 169–188. doi:10.1093/llc/fqv037. ISSN   2055-7671.
  17. Younes, Nadja; Reips, Ulf-Dietrich (2019-03-22). "Guideline for improving the reliability of Google Ngram studies: Evidence from religious terms". PLOS ONE. 14 (3): e0213554. Bibcode:2019PLoSO..1413554Y. doi: 10.1371/journal.pone.0213554 . ISSN   1932-6203. PMC   6430395 . PMID   30901329.
  18. Google n-grams and pre-modern Chinese. digitalsinology.org.
  19. When n-grams go bad. digitalsinology.org.

Bibliography