CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 2 of 2
  1. #1
    Join Date
    Jan 2011
    Posts
    2

    HTML table to 2D array parser

    Hello everyone!
    The problem is to parse the HTML table into the 2D array.
    The parser should handle rowspans, collspans, nested tables and so on. That means, everything that suites HTML standard. The thing is that the tables I need to parse are automatically generated by another program, so they are rather complex and excessive.
    To date I found a solutions on php (JS_Extractor http://jacksleight.com/old/blog/2008...able-extractor), and Java (http://simbot.wordpress.com/2006/09/...-table-parser/), but is seems like they don't suite my requirements.
    In addition, I found JWebUnit and HTTPUnit, but I didn't found how to use them for my purposes (and I am not even sure that it is possible).
    I'll be happy for any help, as I can't believe this problem was not solved yet!
    Thanks in advance!

  2. #2
    Join Date
    Jan 2011
    Posts
    2

    Re: HTML table to 2D array parser

    1. The program must be written in Java;
    2. If the table contains spans, the value of the spanned cell should be put only ones in array in the left up corner of cells of this array, the other cells, correspondent to the spanned cell in HTML table should be nulled (maybe it is hard to understand, see examples below).
    So, the JS_Extractor is not valid, as it is written on PHP and doesn't handle inherited tables. Java HTML Table parser Simbiosis doesn't handle even spans.
    Today I tried to use HTTPUnit, but the results are disappointing too. The simple tables are parsed correctly, but the complex one not.
    E.g.
    Table code:
    HTML Code:
    <html>
    	<body>
    	<table   border="2" width="20%" height="20%">
    		<tr bgcolor="red">
    			<td colspan="2" rowspan="2">
    				<span>1</span>
    			</td>
    			<td>
    				<span>2</span>
    			</td>
    			<td>
    				<span>3</span></td>
    			<td>
    				<span>4.1</span>
    			</td>
    			<td>
    				<span>5.1</span>
    			</td>
    			<td>
    				<span>6 last</span>
    			</td>
    		</tr>
    		<tr bgcolor="green">
    			<td rowspan="2">
    				<span>1</span>
    			</td>
    			<td>
    				<span>2.4x</span>
    			</td>
    			<td>
    				<span>3.3x</span>
    			</td>
    			<td>
    				<span>4</span>
    			</td>
    			<td>
    				<span>5 last</span>
    			</td>
    		</tr>
    		<tr bgcolor="ffcc00">
    			<td>
    				<span>1x</span>
    			</td>
    			<td>
    				<span>2</span>
    			</td>
    			<td>
    				<span>3</span>
    			</td>
    			<td>
    				<span>4</span>
    			</td>
    			<td>
    				<span>5.8</span>
    			</td>
    			<td>
    				<span>6 last</span>
    			</td>
    		</tr>
    		<tr bgcolor="yellow">
    			<td><span>1</span></td>
    			<td><span>2</span></td>
    			<td><span>3</span></td>
    			<td><span>4</span></td>
    			<td><span>5</span></td>
    			<td><span>6</span></td>
    			<td><span>7 last</span></td>
    		</tr>
    	</table>	
    </body>
    </html>
    This is a table, shown in Chrome:

    An array I want to see as a result:

    Here you can see what I meant in the requirement number 2. "1" from the fist span and "1" from the second areas are put in the left upper corner of the spanned area, while the rest cells in this area are null.
    The result, given by HTTPUnit:

    As you can see, even if we throw the requirement 2 away, we have an error in the third row here.
    And this is a rather simple example, without inherited tables, with them it is terribly wrong.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured