Data preprocessing

Last updated May 27, 2024

Data preprocessing can refer to manipulation, filtration or augmentation of data before it is analyzed,^[1] and is often an important step in the data mining process. Data collection methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and missing values, amongst other issues.

The preprocessing pipeline used can often have large effects on the conclusions drawn from the downstream analysis. Thus, representation and quality of data is necessary before running any analysis.^[2] Often, data preprocessing is the most important phase of a machine learning project, especially in computational biology.^[3] If there is a high proportion of irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase may be more difficult. Data preparation and filtering steps can take a considerable amount of processing time. Examples of methods used in data preprocessing include cleaning, instance selection, normalization, one-hot encoding, data transformation, feature extraction and feature selection.

Applications

Data mining

Data preprocessing allows for the removal of unwanted data with the use of data cleaning, this allows the user to have a dataset to contain more valuable information after the preprocessing stage for data manipulation later in the data mining process. Editing such dataset to either correct data corruption or human error is a crucial step to get accurate quantifiers like true positives, true negatives, false positives and false negatives found in a confusion matrix that are commonly used for a medical diagnosis. Users are able to join data files together and use preprocessing to filter any unnecessary noise from the data which can allow for higher accuracy. Users use Python programming scripts accompanied by the pandas library which gives them the ability to import data from a comma-separated values as a data-frame. The data-frame is then used to manipulate data that can be challenging otherwise to do in Excel. Pandas (software) which is a powerful tool that allows for data analysis and manipulation; which makes data visualizations, statistical operations and much more, a lot easier. Many also use the R programming language to do such tasks as well.

The reason why a user transforms existing files into a new one is because of many reasons. Aspects of data preprocessing may include imputing missing values, aggregating numerical quantities and transforming continuous data into categories (data binning).^[4] More advanced techniques like principal component analysis and feature selection are working with statistical formulas and are applied to complex datasets which are recorded by GPS trackers and motion capture devices.

Semantic data preprocessing

Semantic data mining is a subset of data mining that specifically seeks to incorporate domain knowledge, such as formal semantics, into the data mining process. Domain knowledge is the knowledge of the environment the data was processed in. Domain knowledge can have a positive influence on many aspects of data mining, such as filtering out redundant or inconsistent data during the preprocessing phase.^[5] Domain knowledge also works as constraint. It does this by using working as set of prior knowledge to reduce the space required for searching and acting as a guide to the data. Simply put, semantic preprocessing seeks to filter data using the original environment of said data more correctly and efficiently.

There are increasingly complex problems which are asking to be solved by more elaborate techniques to better analyze existing information.^{[ fact or opinion? ]} Instead of creating a simple script for aggregating different numerical values into a single value, it make sense to focus on semantic based data preprocessing.^[6] The idea is to build a dedicated ontology, which explains on a higher level what the problem is about.^[7] In regards to semantic data mining and semantic pre-processing, ontologies are a way to conceptualize and formally define semantic knowledge and data. The Protégé (software) is the standard tool for constructing an ontology.^{[ citation needed ]} In general, the use of ontologies bridges the gaps between data, applications, algorithms, and results that occur from semantic mismatches. As a result, semantic data mining combined with ontology has many applications where semantic ambiguity can impact the usefulness and efficiency of data systems.^{[ citation needed ]} Applications include the medical field, language processing, banking,^[8] and even tutoring,^[9] among many more.

There are various strengths to using a semantic data mining and ontological based approach. As previously mentioned, these tools can help during the per-processing phase by filtering out non-desirable data from the data set. Additionally, well-structured formal semantics integrated into well designed ontologies can return powerful data that can be easily read and processed by machines.^[10] A specifically useful example of this exists in the medical use of semantic data processing. As an example, a patient is having a medical emergency and is being rushed to hospital. The emergency responders are trying to figure out the best medicine to administer to help the patient. Under normal data processing, scouring all the patient’s medical data to ensure they are getting the best treatment could take too long and risk the patients’ health or even life. However, using semantically processed ontologies, the first responders could save the patient’s life. Tools like a semantic reasoner can use ontology to infer the what best medicine to administer to the patient is based on their medical history, such as if they have a certain cancer or other conditions, simply by examining the natural language used in the patient's medical records.^[11] This would allow the first responders to quickly and efficiently search for medicine without having worry about the patient’s medical history themselves, as the semantic reasoner would already have analyzed this data and found solutions. In general, this illustrates the incredible strength of using semantic data mining and ontologies. They allow for quicker and more efficient data extraction on the user side, as the user has fewer variables to account for, since the semantically pre-processed data and ontology built for the data have already accounted for many of these variables. However, there are some drawbacks to this approach. Namely, it requires a high amount of computational power and complexity, even with relatively small data sets.^[12] This could result in higher costs and increased difficulties in building and maintaining semantic data processing systems. This can be mitigated somewhat if the data set is already well organized and formatted, but even then, the complexity is still higher when compared to standard data processing.^{[ tone ]}

Below is a simple a diagram combining some of the processes, in particular semantic data mining and their use in ontology.

The diagram depicts a data set being broken up into two parts: the characteristics of its domain, or domain knowledge, and then the actual acquired data. The domain characteristics are then processed to become user understood domain knowledge that can be applied to the data. Meanwhile, the data set is processed and stored so that the domain knowledge can applied to it, so that the process may continue. This application forms the ontology. From there, the ontology can be used to analyze data and process results.

Fuzzy preprocessing is another, more advanced technique for solving complex problems. Fuzzy preprocessing and fuzzy data mining make use of fuzzy sets. These data sets are composed of two elements: a set and a membership function for the set which comprises 0 and 1. Fuzzy preprocessing uses this fuzzy data set to ground numerical values with linguistic information. Raw data is then transformed into natural language. Ultimately, fuzzy data mining's goal is to help deal with inexact information, such as an incomplete database. Currently fuzzy preprocessing, as well as other fuzzy based data mining techniques see frequent use with neural networks and artificial intelligence.^[13]

Related Research Articles

Knowledge representation and reasoning is the field of artificial intelligence (AI) dedicated to representing information about the world in a form that a computer system can use to solve complex tasks such as diagnosing a medical condition or having a dialog in a natural language. Knowledge representation incorporates findings from psychology about how humans solve problems and represent knowledge, in order to design formalisms that will make complex systems easier to design and build. Knowledge representation and reasoning also incorporates findings from logic to automate various kinds of reasoning.

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

In information science, an ontology encompasses a representation, formal naming, and definitions of the categories, properties, and relations between the concepts, data, or entities that pertain to one, many, or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of terms and relational expressions that represent the entities in that subject area. The field which studies ontologies so conceived is sometimes referred to as applied ontology.

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language.

The expression computational intelligence (CI) usually refers to the ability of a computer to learn a specific task from data or experimental observation. Even though it is commonly considered a synonym of soft computing, there is still no commonly accepted definition of computational intelligence.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

Terminology extraction is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.

Frames are an artificial intelligence data structure used to divide knowledge into substructures by representing "stereotyped situations". They were proposed by Marvin Minsky in his 1974 article "A Framework for Representing Knowledge". Frames are the primary data structure used in artificial intelligence frame languages; they are stored as ontologies of sets.

The National Centre for Text Mining (NaCTeM) is a publicly funded text mining (TM) centre. It was established to provide support, advice and information on TM technologies and to disseminate information from the larger TM community, while also providing services and tools in response to the requirements of the United Kingdom academic community.

Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.

Ontology-based data integration involves the use of one or more ontologies to effectively combine data or information from multiple heterogeneous sources. It is one of the multiple data integration approaches and may be classified as Global-As-View (GAV). The effectiveness of ontology‑based data integration is closely tied to the consistency and expressivity of the ontology used in the integration process.

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

hakia was an Internet search engine. Launched in March 2004 and based in New York City, hakia attempted to pioneer a semantic search engine in contrast to keyword search engines that were established at that time. The search engine ceased operations in 2014. Since 2015 the domain has been owned by HughesNet.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

DSSim is an ontology mapping system, that has been conceived to achieve a certain level of the envisioned machine intelligence on the Semantic Web. The main driving factors behind its development was to provide an alternative to the existing heuristics or machine learning based approaches with a multi-agent approach that makes use of uncertain reasoning. The system provides a possible approach to establish machine understanding over Semantic Web data through multi-agent beliefs and conflict resolution.

The following outline is provided as an overview of and topical guide to natural-language processing:

Automatic taxonomy construction (ATC) is the use of software programs to generate taxonomical classifications from a body of texts called a corpus. ATC is a branch of natural language processing, which in turn is a branch of artificial intelligence.

In natural language processing (NLP), a text graph is a graph representation of a text item. It is typically created as a preprocessing step to support NLP tasks such as text condensation term disambiguation (topic-based) text summarization, relation extraction and textual entailment.

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence, its sub-disciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the semantics or relationships underlying these entities.

References

↑ "Guide To Data Cleaning: Definition, Benefits, Components, And How To Clean Your Data". Tableau. Retrieved 2021-10-17.
↑ Pyle, D., 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers, Los Altos, California.
↑ Chicco D (December 2017). "Ten quick tips for machine learning in computational biology". BioData Mining. 10 (35): 35. doi: 10.1186/s13040-017-0155-3 . PMC 5721660 . PMID 29234465.
↑ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. ISBN 978-0-387-84884-6.
↑ Dou, Deijing and Wang, Hao and Liu, Haishan. "Semantic Data Mining: A Survey of Ontology-based Approaches" (PDF). University of Oregon.{{cite web}}: CS1 maint: multiple names: authors list (link)
↑ Culmone, Rosario and Falcioni, Marco and Quadrini, Michela (2014). An ontology-based framework for semantic data preprocessing aimed at human activity recognition. SEMAPRO 2014: The Eighth International Conference on Advances in Semantic Processing. Alexey Cheptsov, High Performance Computing Center Stuttgart (HLRS). S2CID 196091422.{{cite conference}}: CS1 maint: multiple names: authors list (link)
↑ David Perez-Rey and Alberto Anguita and Jose Crespo (2006). OntoDataClean: Ontology-Based Integration and Preprocessing of Distributed Data. Biological and Medical Data Analysis. Springer Berlin Heidelberg. pp. 262–272. doi:10.1007/11946465_24.
↑ Yerashenia, Natalia and Bolotov, Alexander and Chan, David and Pierantoni, Gabriele (2020). "Semantic Data Pre-Processing for Machine Learning Based Bankruptcy Prediction Computational Model". 2020 IEEE 22nd Conference on Business Informatics (CBI) (PDF). IEEE. pp. 66–75. doi:10.1109/CBI49978.2020.00015. ISBN 978-1-7281-9926-9. S2CID 219499599.{{cite book}}: CS1 maint: multiple names: authors list (link)
↑ Chang, Maiga; D'Aniello, Giuseppe; Gaeta, Matteo; Orciuoli, Francesco; Sampson, Demetrois; Simonelli, Carmine (2020). "Building Ontology-Driven Tutoring Models for Intelligent Tutoring Systems Using Data Mining". IEEE Access. 8. IEEE: 48151–48162. Bibcode:2020IEEEA...848151C. doi: 10.1109/ACCESS.2020.2979281 . S2CID 214594754.
↑ Dou, Deijing and Wang, Hao and Liu, Haishan. "Semantic Data Mining: A Survey of Ontology-based Approaches" (PDF). University of Oregon.{{cite web}}: CS1 maint: multiple names: authors list (link)
↑ Kahn, Atif and Doucette, John A. and Jin, Changjiu and Fu Lijie and Cohen, Robin. "AN ONTOLOGICAL APPROACH TO DATA MINING FOR EMERGENCY MEDICINE" (PDF). University of Waterloo.{{cite web}}: CS1 maint: multiple names: authors list (link)
↑ Sirichanya, Chanmee and Kraisak Kesorn (2021). "Semantic data mining in the information age: A systematic review". International Journal of Intelligent Systems. 36 (8): 3880–3916. doi: 10.1002/int.22443 . S2CID 235506360.
↑ Wong, Kok Wai and Fung, Chun Che and Law, Kok Way (2000). "Fuzzy preprocessing rules for the improvement of an artificial neural network well log interpretation model". 2000 TENCON Proceedings. Intelligent Systems and Technologies for the New Millennium (Cat. No.00CH37119). Vol. 1. IEEE. pp. 400–405. doi:10.1109/TENCON.2000.893697. ISBN 0-7803-6355-8. S2CID 10384426.{{cite book}}: CS1 maint: multiple names: authors list (link)

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Guide To Data Cleaning: Definition, Benefits, Components, And How To Clean Your Data". Tableau. Retrieved 2021-10-17.

[2] Pyle, D., 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers, Los Altos, California.

[3] Chicco D (December 2017). "Ten quick tips for machine learning in computational biology". BioData Mining. 10 (35): 35. doi: 10.1186/s13040-017-0155-3 . PMC 5721660 . PMID 29234465.

[4] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. ISBN 978-0-387-84884-6.

[5] Dou, Deijing and Wang, Hao and Liu, Haishan. "Semantic Data Mining: A Survey of Ontology-based Approaches" (PDF). University of Oregon.{{cite web}}: CS1 maint: multiple names: authors list (link)

[6] Culmone, Rosario and Falcioni, Marco and Quadrini, Michela (2014). An ontology-based framework for semantic data preprocessing aimed at human activity recognition. SEMAPRO 2014: The Eighth International Conference on Advances in Semantic Processing. Alexey Cheptsov, High Performance Computing Center Stuttgart (HLRS). S2CID 196091422.{{cite conference}}: CS1 maint: multiple names: authors list (link)

[7] David Perez-Rey and Alberto Anguita and Jose Crespo (2006). OntoDataClean: Ontology-Based Integration and Preprocessing of Distributed Data. Biological and Medical Data Analysis. Springer Berlin Heidelberg. pp. 262–272. doi:10.1007/11946465_24.

[8] Yerashenia, Natalia and Bolotov, Alexander and Chan, David and Pierantoni, Gabriele (2020). "Semantic Data Pre-Processing for Machine Learning Based Bankruptcy Prediction Computational Model". 2020 IEEE 22nd Conference on Business Informatics (CBI) (PDF). IEEE. pp. 66–75. doi:10.1109/CBI49978.2020.00015. ISBN 978-1-7281-9926-9. S2CID 219499599.{{cite book}}: CS1 maint: multiple names: authors list (link)

[9] Chang, Maiga; D'Aniello, Giuseppe; Gaeta, Matteo; Orciuoli, Francesco; Sampson, Demetrois; Simonelli, Carmine (2020). "Building Ontology-Driven Tutoring Models for Intelligent Tutoring Systems Using Data Mining". IEEE Access. 8. IEEE: 48151–48162. Bibcode:2020IEEEA...848151C. doi: 10.1109/ACCESS.2020.2979281 . S2CID 214594754.

[10] Dou, Deijing and Wang, Hao and Liu, Haishan. "Semantic Data Mining: A Survey of Ontology-based Approaches" (PDF). University of Oregon.{{cite web}}: CS1 maint: multiple names: authors list (link)

[11] Kahn, Atif and Doucette, John A. and Jin, Changjiu and Fu Lijie and Cohen, Robin. "AN ONTOLOGICAL APPROACH TO DATA MINING FOR EMERGENCY MEDICINE" (PDF). University of Waterloo.{{cite web}}: CS1 maint: multiple names: authors list (link)

[12] Sirichanya, Chanmee and Kraisak Kesorn (2021). "Semantic data mining in the information age: A systematic review". International Journal of Intelligent Systems. 36 (8): 3880–3916. doi: 10.1002/int.22443 . S2CID 235506360.

[13] Wong, Kok Wai and Fung, Chun Che and Law, Kok Way (2000). "Fuzzy preprocessing rules for the improvement of an artificial neural network well log interpretation model". 2000 TENCON Proceedings. Intelligent Systems and Technologies for the New Millennium (Cat. No.00CH37119). Vol. 1. IEEE. pp. 400–405. doi:10.1109/TENCON.2000.893697. ISBN 0-7803-6355-8. S2CID 10384426.{{cite book}}: CS1 maint: multiple names: authors list (link)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]