Conversation on data mining strategies in LC-MS untargeted metabolomics: pre-processing and pre-treatment steps

Tugizimana, F; Steenkamp, Paul A; Piater, LA; Dubery, IA

dc.contributor.author	Tugizimana, F
dc.contributor.author	Steenkamp, Paul A
dc.contributor.author	Piater, LA
dc.contributor.author	Dubery, IA
dc.date.accessioned	2017-05-16T09:51:47Z
dc.date.available	2017-05-16T09:51:47Z
dc.date.issued	2016-11
dc.identifier.citation	Tugizimana, F., Steenkamp, P.A., Piater, L.A. et al. 2016. A conversation on data mining strategies in LC-MS untargeted metabolomics: pre-processing and pre-treatment steps. Metabolites, vol. 6(4): 18 pp. doi: 10.3390/metabo6040040	en_US
dc.identifier.issn	2218-1989
dc.identifier.uri	http://www.mdpi.com/2218-1989/6/4/40
dc.identifier.uri	10.3390/metabo6040040
dc.identifier.uri	http://hdl.handle.net/10204/9036
dc.description	© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).	en_US
dc.description.abstract	Untargeted metabolomic studies generate information-rich, high-dimensional, and complex datasets that remain challenging to handle and fully exploit. Despite the remarkable progress in the development of tools and algorithms, the “exhaustive” extraction of information from these metabolomic datasets is still a non-trivial undertaking. A conversation on data mining strategies for a maximal information extraction from metabolomic data is needed. Using a liquid chromatography-mass spectrometry (LC-MS)-based untargeted metabolomic dataset, this study explored the influence of collection parameters in the data pre-processing step, scaling and data transformation on the statistical models generated, and feature selection, thereafter. Data obtained in positive mode generated from a LC-MS-based untargeted metabolomic study (sorghum plants responding dynamically to infection by a fungal pathogen) were used. Raw data were pre-processed with MarkerLynxTM software (Waters Corporation, Manchester, UK). Here, two parameters were varied: the intensity threshold (50–100 counts) and the mass tolerance (0.005–0.01 Da). After the pre-processing, the datasets were imported into SIMCA (Umetrics, Umea, Sweden) for more data cleaning and statistical modeling. In addition, different scaling (unit variance, Pareto, etc.) and data transformation (log and power) methods were explored. The results showed that the pre-processing parameters (or algorithms) influence the output dataset with regard to the number of defined features. Furthermore, the study demonstrates that the pre-treatment of data prior to statistical modeling affects the subspace approximation outcome: e.g., the amount of variation in X-data that the model can explain and predict. The pre-processing and pre-treatment steps subsequently influence the number of statistically significant extracted/selected features (variables). Thus, as informed by the results, to maximize the value of untargeted metabolomic data, understanding of the data structures and exploration of different algorithms and methods (at different steps of the data analysis pipeline) might be the best trade-off, currently, and possibly an epistemological imperative.	en_US
dc.language.iso	en	en_US
dc.publisher	MDPI AG, Basel, Switzerland	en_US
dc.rights	CC0 1.0 Universal	*
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/	*
dc.subject	Chemometrics	en_US
dc.subject	Data mining	en_US
dc.subject	Metabolomics	en_US
dc.subject	Pre-processing	en_US
dc.subject	Pre-treatment	en_US
dc.title	Conversation on data mining strategies in LC-MS untargeted metabolomics: pre-processing and pre-treatment steps	en_US
dc.type	Article	en_US
dc.identifier.apacitation	Tugizimana, F., Steenkamp, P. A., Piater, L., & Dubery, I. (2016). Conversation on data mining strategies in LC-MS untargeted metabolomics: pre-processing and pre-treatment steps. http://hdl.handle.net/10204/9036	en_ZA
dc.identifier.chicagocitation	Tugizimana, F, Paul A Steenkamp, LA Piater, and IA Dubery "Conversation on data mining strategies in LC-MS untargeted metabolomics: pre-processing and pre-treatment steps." (2016) http://hdl.handle.net/10204/9036	en_ZA
dc.identifier.vancouvercitation	Tugizimana F, Steenkamp PA, Piater L, Dubery I. Conversation on data mining strategies in LC-MS untargeted metabolomics: pre-processing and pre-treatment steps. 2016; http://hdl.handle.net/10204/9036.	en_ZA
dc.identifier.ris	TY - Article AU - Tugizimana, F AU - Steenkamp, Paul A AU - Piater, LA AU - Dubery, IA AB - Untargeted metabolomic studies generate information-rich, high-dimensional, and complex datasets that remain challenging to handle and fully exploit. Despite the remarkable progress in the development of tools and algorithms, the “exhaustive” extraction of information from these metabolomic datasets is still a non-trivial undertaking. A conversation on data mining strategies for a maximal information extraction from metabolomic data is needed. Using a liquid chromatography-mass spectrometry (LC-MS)-based untargeted metabolomic dataset, this study explored the influence of collection parameters in the data pre-processing step, scaling and data transformation on the statistical models generated, and feature selection, thereafter. Data obtained in positive mode generated from a LC-MS-based untargeted metabolomic study (sorghum plants responding dynamically to infection by a fungal pathogen) were used. Raw data were pre-processed with MarkerLynxTM software (Waters Corporation, Manchester, UK). Here, two parameters were varied: the intensity threshold (50–100 counts) and the mass tolerance (0.005–0.01 Da). After the pre-processing, the datasets were imported into SIMCA (Umetrics, Umea, Sweden) for more data cleaning and statistical modeling. In addition, different scaling (unit variance, Pareto, etc.) and data transformation (log and power) methods were explored. The results showed that the pre-processing parameters (or algorithms) influence the output dataset with regard to the number of defined features. Furthermore, the study demonstrates that the pre-treatment of data prior to statistical modeling affects the subspace approximation outcome: e.g., the amount of variation in X-data that the model can explain and predict. The pre-processing and pre-treatment steps subsequently influence the number of statistically significant extracted/selected features (variables). Thus, as informed by the results, to maximize the value of untargeted metabolomic data, understanding of the data structures and exploration of different algorithms and methods (at different steps of the data analysis pipeline) might be the best trade-off, currently, and possibly an epistemological imperative. DA - 2016-11 DB - ResearchSpace DP - CSIR KW - Chemometrics KW - Data mining KW - Metabolomics KW - Pre-processing KW - Pre-treatment LK - https://researchspace.csir.co.za PY - 2016 SM - 2218-1989 T1 - Conversation on data mining strategies in LC-MS untargeted metabolomics: pre-processing and pre-treatment steps TI - Conversation on data mining strategies in LC-MS untargeted metabolomics: pre-processing and pre-treatment steps UR - http://hdl.handle.net/10204/9036 ER -	en_ZA

Files in this item

Name: Tugizimana_2016.pdf

Size: 2.042Mb

Format: PDF

Description: Article

View/Open

The following license files are associated with this item:

Creative Commons

This item appears in the following Collection(s)

Journal Articles

Show simple item record

Except where otherwise noted, this item's license is described as CC0 1.0 Universal

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.