Text segmentation

Last updated May 10, 2024

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.

Segmentation problems

Word segmentation

Word segmentation is the problem of dividing a string of written language into its component words.

In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter), although this concept has limits because of the variability with which languages emically regard collocations and compounds. Many English compound nouns are variably written (for example, ice box = ice-box = icebox ; pig sty = pig-sty = pigsty ) with a corresponding variation in whether speakers think of them as noun phrases or single nouns; there are trends in how norms are set, such as that open compounds often tend eventually to solidify by widespread convention, but variation remains systemic. In contrast, German compound nouns show less orthographic variation, with solidification being a stronger norm.

However, the equivalent to the word space character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited.

In some writing systems however, such as the Ge'ez script used for Amharic and Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.

The Unicode Consortium has published a Standard Annex on Text Segmentation,^[1] exploring the issues of segmentation in multiscript texts.

Word splitting is the process of parsing concatenated text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.

Word splitting may also refer to the process of hyphenation.

Some scholars have suggested that modern Chinese should be written in word segmentation, with spaces between words like written English.^[2] Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国会不同意。" (The US will not agree.) or "美国会不同意。" (The US Congress does not agree). For more details, see Chinese word-segmented writing.

Intent segmentation

Intent segmentation is the problem of dividing written words into keyphrases (2 or more group of words).

In English and all other languages the core intent or desire is identified and become the corner-stone of the keyphrase Intent segmentation. Core product/service, idea, action & or thought anchor the keyphrase.

"[All things are made of atoms]. [Little particles that move] [around in perpetual motion], [attracting each other] [when they are a little distance apart], [but repelling] [upon being squeezed] [into one another]."

Sentence segmentation

Sentence segmentation is the problem of dividing a string of written language into its component sentences. In English and some other languages, using punctuation, particularly the full stop/period character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example, Mr. is not its own sentence in "Mr. Smith went to the shops in Jones Street." When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.

As with word segmentation, not all written languages contain punctuation characters that are useful for approximating sentence boundaries.

Topic segmentation

Topic analysis consists of two main tasks: topic identification and text segmentation. While the first is a simple classification of a specific text, the latter case implies that a document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other cases, one needs to use techniques similar to those used in document classification.

Segmenting the text into topics or discourse turns might be useful in some natural processing tasks: it can improve information retrieval or speech recognition significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in topic detection and tracking systems and text summarizing problems.

Many different approaches have been tried:^[3]^[4] e.g. HMM, lexical chains, passage similarity using word co-occurrence, clustering, topic modeling, etc.

It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem.

Automatic segmentation approaches

Automatic segmentation is the problem in natural language processing of implementing a computer process to segment text.

When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.

The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:

Manual analysis of text and writing custom software
Annotate the sample corpus with boundary information and use machine learning

Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and information retrieval. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. To this end, natural language processing often borrows ideas from theoretical linguistics. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

The hyphen‐ is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. Son-in-law is an example of a hyphenated word.

Scriptio continua, also known as scriptura continua or scripta continua, is a style of writing without spaces or other marks between the words or sentences. The form also lacks punctuation, diacritics, or distinguished letter case. In the West, the oldest Greek and Latin inscriptions used word dividers to separate words in sentences; however, Classical Greek and late Classical Latin both employed scriptio continua as the norm.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is related to the type of tokenization used in Large language models (LLMs), but with two differences. First, lexical tokenization is usually based on a lexical grammar, whereas LLM tokenizers are usually probability-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.

In writing, a space is a blank area that separates words, sentences, syllables and other written or printed glyphs (characters). Conventions for spacing vary among languages, and in some languages the spacing rules are complex. Inter-word spaces ease the reader's task of identifying words, and avoid outright ambiguities such as "now here" vs. "nowhere". They also provide convenient guides for where a human or program may start new lines.

Sona is an international auxiliary language created by Kenneth Searight and described in a book he published in 1935. The word Sona in the language itself means "auxiliary neutral thing". Contrary to popular belief, the similarity to the English word 'sonorous' is superficial.

Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

In linguistics and grammar, a sentence is a linguistic expression, such as the English example "The quick brown fox jumps over the lazy dog." In traditional grammar, it is typically defined as a string of words that expresses a complete thought, or as a unit consisting of a subject and predicate. In non-functional linguistics it is typically defined as a maximal unit of syntactic structure such as a constituent. In functional linguistics, it is defined as a unit of written texts delimited by graphological features such as upper-case letters and markers such as periods, question marks, and exclamation marks. This notion contrasts with a curve, which is delimited by phonologic features such as pitch and loudness and markers such as pauses; and with a clause, which is a sequence of words that represents some process going on throughout time. A sentence can include words grouped meaningfully to express a statement, question, exclamation, request, command, or suggestion.

In linguistics, prosody is the study of elements of speech that are not individual phonetic segments but which are properties of syllables and larger units of speech, including linguistic functions such as intonation, stress, and rhythm. Such elements are known as suprasegmentals.

<span class="mw-page-title-main">Word</span> Basic element of language

A word is a basic element of language that carries meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguists on its definition and numerous attempts to find specific criteria of the concept remain controversial. Different standards have been proposed, depending on the theoretical background and descriptive context; these do not converge on a single definition. Some specific definitions of the term "word" are employed to convey its different meanings at different levels of description, for example based on phonological, grammatical or orthographic basis. Others suggest that the concept is simply a convention used in everyday situations.

Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing.

Hungarian orthography consists of rules defining the standard written form of the Hungarian language. It includes the spelling of lexical words, proper nouns and foreign words (loanwords) in themselves, with suffixes, and in compounds, as well as the hyphenation of words, punctuation, abbreviations, collation, and other information.

Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities. About 47% of the periods in The Wall Street Journal corpus denote abbreviations. Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, computer code, and slang.

Writing systems that use Chinese characters also include various punctuation marks, derived from both Chinese and Western sources. Historically, jùdú annotations were often used to indicate the boundaries of sentences and clauses in text. The use of punctuation in written Chinese only became mandatory during the 20th century, due to Western influence. Unlike modern punctuation, judu marks were added by scholars for pedagogical purposes and were not viewed as integral to the text. Texts were therefore generally transmitted without judu. In most cases, this practice did not interfere with the interpretation of a text, although it occasionally resulted in ambiguity.

Truecasing, also called capitalization recovery, capitalization correction, or case restoration, is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable. This commonly comes up due to the standard practice of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text.

The following outline is provided as an overview of and topical guide to natural-language processing:

Punctuation in the English language helps the reader to understand a sentence through visual means other than just the letters of the alphabet. English punctuation has two complementary aspects: phonological punctuation, linked to how the sentence can be read aloud, particularly to pausing; and grammatical punctuation, linked to the structure of the sentence. In popular discussion of language, incorrect punctuation is often seen as an indication of lack of education and of a decline of standards.

Chinese word-segmented writing, or Chinese word-separated writing, is a style of written Chinese where texts are written with spaces between words like written English. Chinese sentences are traditionally written as strings of characters, with no marks between words. Hence, word segmentation according to the context is a task for the reader.

Chinese computational linguistics is the scientific study and information processing of the Chinese language by means of computers. The purpose is to obtain a better understanding of how the language works and to bring more convenience to language applications. The term Chinese computational linguistics is often employed interchangeably with Chinese information processing, though the former may sound more theoretical while the latter more technical.

References

↑ UAX #29
↑ Zhang, Xiao-heng (1998). "也谈汉语书面语的分词问题——分词连写十大好处 (Written Chinese Word Segmentation Revisited: Ten advantages of word-segmented writing)". Journal of Chinese Information Processing. 12 (1998) (3): 58–64.
↑ Freddy Y. Y. Choi (2000). "Advances in domain independent linear text segmentation" (PDF). Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00). pp. 26–33.
↑ Jeffrey C. Reynar (1998). "Topic Segmentation: Algorithms and Applications" (PDF). IRCS-98-21. University of Pennsylvania . Retrieved 8 November 2007.{{cite journal}}: Cite journal requires |journal= (help)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] UAX #29

[2] Zhang, Xiao-heng (1998). "也谈汉语书面语的分词问题——分词连写十大好处 (Written Chinese Word Segmentation Revisited: Ten advantages of word-segmented writing)". Journal of Chinese Information Processing. 12 (1998) (3): 58–64.

[3] Freddy Y. Y. Choi (2000). "Advances in domain independent linear text segmentation" (PDF). Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00). pp. 26–33.

[4] Jeffrey C. Reynar (1998). "Topic Segmentation: Algorithms and Applications" (PDF). IRCS-98-21. University of Pennsylvania . Retrieved 8 November 2007.{{cite journal}}: Cite journal requires |journal= (help)

[1]

[2]

[3]

[4]