RubyCAOS
Ahmed Ragab Nabhan, MS1,3; Indra Neil
Sarkar, PhD, MLIS1,2,3
1. Center for Clinical and Translational Science, University of Vermont, Burlington, VT USA
2. Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT USA
3. Department of Computer Science, University of Vermont, Burlington, VT USA
Contact: Prof. Indra Neil Sarkar
In an ideal world, a full phylogenetic analysis would be done on every new sequence found.
However, this is often a time-consuming and computer-intensive task. The Characteristic
Attribute Organization System (CAOS) is an algorithm to discover descriptive information
about
a priori organized discretized data (e.g., from a phylogenetic analysis).
Derived from some of the fundamental tenets from evolutionary biology, rule sets can be
generated from data sets that can be used for effecicient and effective classification of
novel data elements. It important to emphasize that CAOS is NOT a tree analysis, instead it
is a classification scheme based on a phylogenetic analysis. Based on information (rules)
discovered from the phylogenetic analysis that unambiguously distinguish between each node
on a tree, CAOS is able to classify novel sequences. Studies have indicated that CAOS-based
classification has over a 95% accuracy rate of classification, as determined by where sequences
would be classified if a full phylogenetic analysis were to be done using the novel and known
sequences. Publications about CAOS as well as those citing or using CAOS can be found
here.
RubyCAOS is an implementation of the CAOS approach designed to approximate a
parsimony-based tree analysis. The software tool for computes diagnostic character states from
phylogenetic trees and uses them for classification of new molecular sequences. RubyCAOS is a
platform independent version of the same algorithm that was orginally written in Perl and C++.
This tool is freely available and can be used as both a standalone application as well as a class
library.
A Web version enables simple-based CAOS procedures (single character positions),
either for only extracting diagnostics or for classification of new sequences.