Click to See Complete Forum and Search --> : Automatic Classification of Text Messages


casasoft
October 7th, 2010, 07:18 PM
We need to implement a learning system which will learn to classify messages into various categories.

We will have various categories in the system and then initially we will manually teach the system how to classify the input text messages.

Initially, during training we would:

1. Input a text message to the system
2. Manually identify the most appropriate categories within which such a message is to fall in an order, for example mostly in Category B, then in Category A and then in Category C.

Training will continue until system learns patterns from the message to classify it into different categories.

The scope of the system is that later on, we would be able to give it an input message and it will automatically tell us which categories fit most for this particular message in a weighted order. So we would input a text message to the system and the system will output the most categories within which the messages goes to, for example, mostly Category D, then Category F and then Category G etc...

We would also like to be able to correct the system manually, so if for a particular message, the system ranks best Category D, followed by Category F and then by Category G, we might want to manually correct it to say that the best is Category F, then Category G and then Category D.

If you need any other clarifications of our requirements, let me know! :)

What is the best algorithm to implement such a learning system? Open for ideas!

Thanks in advance!

casasoft
October 8th, 2010, 02:06 PM
What would you think about Bayesian network algorithms or Artificial Neural Networks for such automatic classification, any ideas?

MikeAThon
October 8th, 2010, 08:04 PM
I know nothing about this topic, but here are some quick Google results.

According to the following article, there are two main techniques for classification of text: discriminative methods and generative methods. The article is "How Much Noise is too Much: A Study in Automatic Text Classification" by Sumeet Agarwal and others, found at http://www.godbole.net/shantanu/pubs/howmuchnoise-icdm07.pdf . See Section 2, related work.

One example of the discriminative method, which is used in the article, is support vector machines (SVMs), for which the author of the article uses a free library called SVM-Light. SVM-Light is found at http://www.cs.cornell.edu/People/tj/svm_light/ . Look toward the middle of the page and you will see an actual implementation of a document classifier that uses SVM-Light, also downloadable. The task there was to learn (and automatically classify) which Reuters articles are about "corporate acquisitions". Sample data includes 1000 positive and 1000 negative examples, as well as 600 test examples.

One example of the generative methods, which is compared against SVMs in the article, is naive Bayes (NB), for which the author of the article uses a free library called the "BOW Toolkit". The BOW toolkit is available at http://www.cs.cmu.edu/~mccallum/bow/ . Look towards the bottom of the page and you will see a sample called "Rainbow" which is said to perform document classification.

Good luck
Mike

casasoft
October 12th, 2010, 08:14 AM
Thanks a lot for your help, will look into them! :)