TY - GEN
T1 - Purifying data by machine learning with certainty levels
AU - Dolev, Shlomi
AU - Leshem, Guy
AU - Yagel, Reuven
PY - 2010/12/1
Y1 - 2010/12/1
N2 - A fundamental paradigm used for autonomic computing, self-managing systems, and decision-making under uncertainty and faults is machine learning. Machine learning uses a data-set, or a set of data-items. A data-item is a vector of feature values and a classification. Occasionally these data sets include misleading data items that were either introduced by input device malfunctions, or were maliciously inserted to lead the machine learning to wrong conclusions. A reliable learning algorithm must be able to handle a corrupted data-set. Otherwise, an adversary (or simply a malfunctioning input device that corrupts a portion of the data-set) may lead to inaccurate classifications. Therefore, the challenge is to find effective methods to evaluate and increase the certainty level of the learning process as much as possible. This paper introduces the use of a certainty level measure to obtain better classification capability in the presence of corrupted data items. Assuming a known data distribution (e.g., a normal distribution) and/or a known upper bound on the given number of corrupted data items, our techniques define a certainty level for classifications. Another approach suggests enhancing the random forest techniques to cope with corrupted data items by augmenting the certainty level for the classification obtained in each leaf in the forest. This method is of independent interest, that of significantly improving the classification of the random forest machine learning technique in less severe settings.
AB - A fundamental paradigm used for autonomic computing, self-managing systems, and decision-making under uncertainty and faults is machine learning. Machine learning uses a data-set, or a set of data-items. A data-item is a vector of feature values and a classification. Occasionally these data sets include misleading data items that were either introduced by input device malfunctions, or were maliciously inserted to lead the machine learning to wrong conclusions. A reliable learning algorithm must be able to handle a corrupted data-set. Otherwise, an adversary (or simply a malfunctioning input device that corrupts a portion of the data-set) may lead to inaccurate classifications. Therefore, the challenge is to find effective methods to evaluate and increase the certainty level of the learning process as much as possible. This paper introduces the use of a certainty level measure to obtain better classification capability in the presence of corrupted data items. Assuming a known data distribution (e.g., a normal distribution) and/or a known upper bound on the given number of corrupted data items, our techniques define a certainty level for classifications. Another approach suggests enhancing the random forest techniques to cope with corrupted data items by augmenting the certainty level for the classification obtained in each leaf in the forest. This method is of independent interest, that of significantly improving the classification of the random forest machine learning technique in less severe settings.
KW - Certainty level
KW - Data corruption
KW - Machine learning
KW - Pac learning
UR - https://www.scopus.com/pages/publications/79953146986
U2 - 10.1145/1953563.1953567
DO - 10.1145/1953563.1953567
M3 - Conference contribution
AN - SCOPUS:79953146986
SN - 9781450306423
T3 - Proceedings of the 3rd International ACM Workshop on Reliability, Availability, and Security, WRAS 2010
BT - Proceedings of the 3rd International ACM Workshop on Reliability, Availability, and Security, WRAS 2010
T2 - 3rd International ACM Workshop on Reliability, Availability, and Security, WRAS 2010
Y2 - 29 July 2010 through 29 July 2010
ER -