Prototype generation algorithms for data streams and large training datasets

Την Πέμπτη 09/01/2020, και ώρα 19:00, στο εργαστήριο 434, θα πραγματοποιηθεί η επόμενη συνάντηση του εργαστηρίου SDE. Ο κ. Ουγιάρογλου Στέφανος μεταδιδάκτορας του τμήματος με επιβλέποντα της μεταδιδακτορικής έρευνας τον κ. Γ. Ευαγγελίδη, θα πραγματοποιήσει παρουσίαση με αντικείμενο “Αλγόριθμοι δημιουργίας αντιπροσώπων κατηγοριοποίησης για ροές και μεγάλα σύνολα δεδομένων εκπαίδευσης (Prototype generation algorithms for data streams and large training datasets)”

Abstract:

The efficiency and effectiveness of classification algorithms depend on the quality and size of the available training data. The majority of algorithms are inadequate in handling large training datasets that do not fit into the main memory. In the case of the Nearest Neighbor Classifier, which calculates all the distances between an unclassified instance and all training data, the available training data is often pre-processed by a data reduction technique. The goal is to reduce the number of distances calculated and consequently to reduce the computational cost. Such techniques are the prototype selection algorithms and the prototype generation algorithms. These algorithms create a condensing set that is small, represents the original set, and has the advantage of lower classification computational cost. However, these algorithms have not been applied to eager classifiers that are not based on searching for nearest neighbors. Also, most prototype selection and generation algorithms cannot handle data streams and usually have high complexity. As a result, they introduce a time-consuming pre-processing step that is often prohibitive. Finally, noise in training data is misleading and consequently leads to large condensed sets. These observations are the motive behind the postdoctoral research. The contribution is the development of noise-resistant prototype generation algorithms that are appropriate for streaming environments or large datasets where the application of the “conventional” data reduction techniques is inappropriate or even prohibitive because of the high computational cost. The performance of the new algorithms was evaluated empirically using several well-known classification datasets, and, where necessary, the experimental results were statistically analyzed using the Wilcoxon signed-rank test.

Keywords: Classification, Data streams, Data reduction, Prototype generation, Convex hull, Clustering, Parallelism, Map reduce

Λέξεις Κλειδιά: Κατηγοριοποίηση, Ροές δεδομένων, Μείωση δεδομένων, Παραγωγή αντιπροσώπων (Prototype Generation), Convex Hull, Συσταδοποίηση, Παραλληλοποίηση, Map reduce

Prototype generation algorithms for data streams and large training datasets

Leave a Reply Cancel reply