Machine Learning Made Easy
LightSide began as an open source text mining and machine learning tool. Our core technology is freely available. We offer quick-start tutorials on machine learning for beginners and an introduction to error analysis. The workbench, as well as the core technology for machine learning and feature extraction, was developed under grants from National Science Foundation and the Office of Naval Research, through Carnegie Mellon University’s Language Technologies Institute. This core is available as GPLv3 open source on Bitbucket. You can also download the research workbench directly:
Here’s an overview of what to expect.
Fast, Automated Feature Extraction
As a researcher working on text data, you’re well aware of how much time is spent writing the same code over and over. Every new project starts with code for parsing through a dataset and extract term vectors and other basic features.
LightSide streamlines this. Once you’ve moved text into a consistent CSV data format, all basic feature extraction can be done through our simple point-and-click interface, meaning that surface features and some simple natural language processing tools are available to you with zero programming effort.
For researchers interested in feature engineering, LightSide’s FeaturePlugin interface lets you write a single Java method and enter it seamlessly into an end-to-end machine learning pipeline.
State-of-the-Art Machine Learning
We provide immediate, intuitive interfaces for Naïve Bayes classification, support vector machines, logistic regression, and decision trees, as well as linear regression (through both least-squares and support vector regression) for numeric prediction. If those aren’t sufficient for you, we also provide direct, within-program access to Weka’s exhaustive suite of classification algorithms, providing endless variety.
Our interface makes evaluation easy, too. Held-out test sets, as well as a variety of cross-validation methods, are available as point-and-click options. Multiple files can be loaded at once. Performance is evaluated on common metrics including both accuracy as well as more meaningful values, like kappa.
Fast, Informative Error Analysis
Too often, machine learning is a black box. Researchers choose a set of features and a model to train, and they get an accuracy reported back to them by their scripts. If they’re lucky, they have a pipeline set up that allows them to tweak that behavior and evaluate performance changes. Actually looking at the text that’s being misclassified and thinking deeply about why an algorithm thinks it should be labeled a certain way almost never happens.
We’re changing that. We’ve developed a set of interfaces and tools over years of experience which direct you towards the instances that are being misclassified by your algorithm; which singles out features that are associated with error and misclassification; and with direct comparisons between models so that it becomes obvious what’s going right and what’s going wrong with a particular set of feature extraction and machine learning choices.
LightSide has a long and storied history, which includes contributions from Moonyoung Kang, Sourish Chaudhuri, Yi-Chia Wang, Mahesh Joshi, Eric Rosé, Martin Van Velsen, and Carolyn Penstein Rosé. Development of the open-source LightSide platform, including the core of the machine learning and feature extraction tools, as well as the GUI research workbench, has been and continues to be funded in part by grants to Carnegie Mellon’s Language Technologies Institute from the National Science Foundation and the Office of Naval Research:
- ONR N000141110221 (PI Rosé) Towards Optimization of Macrocognitive Processes: Automating Analysis of the Emergence of Leadership in Ad Hoc Teams
- NSF IIS-0968485 (PI Kraut) Conversational Dynamics in Online Support Groups
- NSF DRL-0835426 (PI Rosé) Dynamic Support for Virtual Math Teams
- NSF SBE 0836012 (PI Koedinger) Pittsburgh Sciences of Learning Center
- NSF HCC-0803482 (PI Fussell) HCC Medium: Dynamic Support for Computer-Mediated Intercultural Communication
- ONR N000141010277 (PI Stahl) Theories and Models of Group Cognition
- NSF REESE/REC 0723580 (PI Rosé) Exploring Adaptive Support for Virtual Math Teams
- ONR N000140811033 (PI Rosé) TFLex project extension: Expanding the Accessibility and Impact of Language Technologies for Supporting Education
LightSide’s Research Workbench is licensed under the GPLv3. We make use of several open-source packages, including Weka, LibLinear, Apache Commons Math, RiverLayout, the Stanford POS Tagger, Trove, and icons courtesy of FamFamFam.