-
March 26th, 2018, 03:46 AM
#1
Word stemming - help please
Hi guys I am creating a text analyser which has the following:
1. Tokenize - to parse the long text string
2. Remove the stop words
3. perform stemming
My code so far:
Code:
package searc_engine;
02
03
import javax.swing.JOptionPane;
04
05
06
public class TextAnalyser {
07
08
//JOptionPane.showInputDialog(null,"Type you Input");
09
public static void main(String[] args){
10
11
12
13
String myString = "I was so happpy but innocent they said ok when i asked"; // string
14
15
String stopWords = "I|its|with|but|a|and|be|if|in|it|of|on|or|so|the|they|there|this|which|why";
16
String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", " ");
17
18
19
String delimter = " "; // delimter = where we want to split the string up, e.g. at every space.
20
String [] words = afterStopWords.split(delimter); //array of strings to hold each indivudal word
21
22
for(int i = 0; i < words.length; i++){
23
afterStopWords = afterStopWords.toLowerCase(); // lowercase
24
System.out.println(words[i]);
25
26
}
27
28
29
}
30
31
}
But my problem is i cant do the 3rd part, the stemmer which should:
Perform stemming on the terms: Words having the same stem are usually assumed
to have similar meaning. A typical example of a stem is the word “connect” which is
the stem for the variants “connected”, “connecting”, “connection” and “connections”.
In order to improve the recall of the search (i.e., to get relevant documents which don't
contain the exact words as specified in the query), stemming is performed to remove
the affixes. For example the word 'rides' and 'riding' would both be stemmed to 'ride'.
In the first case this involves the removal of the end character 's'. In the second case
this involves the removal of the characters 'ing' and the addition of the character 'e'.
Porter's algorithm is a well-known stemming algorithm. You may refer to Porter's
algorithm for stemming. You need to implement your own version of the algorithm.
Here you are required to remove the end character „s‟. You can certainly implement
more rules for stemming.
For example, a document CatDog.txt such as:
Cat Dog
The cats and dogs sat in the dog-basket.
will generate the following output:
[cat,dog,cat,dog,sat,dog,basket]
can anyone help please?
-
March 26th, 2018, 03:50 AM
#2
Re: Word stemming - help please
Better format of the code above:
Code:
package searc_engine;
import javax.swing.JOptionPane;
public class TextAnalyser {
//JOptionPane.showInputDialog(null,"Type you Input");
public static void main(String[] args){
String myString = "I was so happpy but innocent they said ok when i asked"; // string
String stopWords = "I|its|with|but|a|and|be|if|in|it|of|on|or|so|the|they|there|this|which|why";
String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", " ");
String delimter = " "; // delimter = where we want to split the string up, e.g. at every space.
String [] words = afterStopWords.split(delimter); //array of strings to hold each indivudal word
for(int i = 0; i < words.length; i++){
afterStopWords = afterStopWords.toLowerCase(); // lowercase
System.out.println(words[i]);
}
}
}
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|