This demo performs K-Means Clustering (unsupervised classification). The algorithm is coded using
scikit-learn in Python running on Flask and Apache in a Docker container. To run the demo, choose the
number of clusters (default is 2), select a dataset, and click the Classify button. The output is a
zip file that contains an Excel spreadsheet with three sheets: one that lists each row of text in the
dataset along with its cluster number, one with the top keywords for each cluster and one that summarizes
the results including the total count for each cluster as well as a simple bar chart of this data.
Two options are available for selecting a dataset. You can upload your own dataset, or use the default
dataset of 250 Amazon reviews provided below. If you upload a dataset, it must contain two columns of
comma separated data with the following headers: id and text. The id column consists of numbers that increment
beginning at 1, and the text column consists of sentences (text strings) enclosed in parentheses.
An example of the data file format is as follows:
id,text
1,"This movie was fantastic."
2,"Hated it!"
3,"The best movie I ever saw."
4,"Good acting and directing but the plot was confusing."