i have a textfile to read from and i need to retrieve the ID which is (AAW192.21.313) and the text which is encoded between the <TEXT> and </TEXT> tags. How do i go about doing it using scanner class.
Textfile format
<DOC>
<ID> AAW192.21.313 </ID>
<TXTTYPE> Hweet </TXTTYPE>
<TEXT>
Skeeter Bronson (Adam Sandler) is a hotel handyman who was promised by his father to be the manager of the family hotel. Barry Nottingham agreed to keep that promise when the Bronson family sold their hotel to him - then built a new hotel instead.
</TEXT>
You could go through, character by character, and once it reaches a '<', check if the next is 't', and so on. Once it recognizes that it found "<text>", store the next characters (maybe set a bool like textFound), and again look for a '<', and once "</text>" is found, discard the last 7 characters (the </text>) from the end of your stored string. Or store the characters after '<' in a separate string, so you don't have to delete them later. Then, if the text after '<' is not "/text", add it to the stored string.
You could also do a google search for working with regular expressions with java. It is not exactly simple but it will be helpful beyond the project you are working on now.
You'd probably be better off parsing this as an xml file and handling it that way, but if you have to do it using the scanner class do the following
1. Read one line at a time
2. Check the input to see if it contains either '<ID>' or '<TEXT>'. If not go back to step 1.
3. Remove the tag from the line.
4. Check for the appropriate closing tag. If it isn't on this line keep reading in lines until you find it, concatenating the lines as you read them in.
5. Remove the tag from the line.
You should now have one string containing the text between the tags.
If you can ensure that the 'text' file will be a properly formatted XML (which it isn't at this time due to no closing tags for DOC), then you can also use something like SAX or DOM parsers, which will parse the elements out for you.
Bookmarks