Results 1 to 4 of 4

Thread: Manipulating the Wikipedia XML Dump file with C#

Thread Tools
- Show Printable Version
Display
- Switch to Linear Mode
- Switch to Hybrid Mode
- Threaded Mode

Threaded View

January 13th, 2009, 04:09 AM #1
alnds

View Profile

View Forum Posts

Junior Member
Join Date

Jan 2009

Posts

3
Manipulating the Wikipedia XML Dump file with C#

Hi guys,

I am in urgent need of processing the Wikipedia database dump file which is a large single XML file of approximately 4.7GB in size. I need to extract individual XML files from it. The wikipedia XML file has the following format.

=================================

<mediawiki xmlns="xxx" xmlns:xsi="xxx" xsi:schemaLocation="xxx" version="x" xml:lang="x">
<namespaces>
<namespace key="1">Talk</namespace>
<namespace key="2">User</namespace>
...
...
</namespaces>
<page>
<title>xxxxxx</title>
<id>6</id>
<revision>
<id>xxxxxx</id>
<timestamp>2007-05-25T17:12:06Z</timestamp>
<contributor>
<username>xxxx</username>
<id>xxxx</id>
</contributor>
<text>xxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxx
</text>
</revision>
</page>
<page>
...
...
...
</page>
<page>
...
...
...
</page>

=================================

What I need is to create individual XML files based on the page id element when I specify the id manually.

I have two questions regarding this.

1) Is C# the best way to go about in processing such a large XML file? Is there any other way to process this file?

2) How do I create individual XML files by extracting <page>.....</page> elements from the above file? How can I provide the list of page id's I need and extract individual XML files from this large files.

Any advice regarding these questions is much appreciated. Thanks in advance to anyone who is able to answer.

Cheers.

Reply With Quote

Quick Navigation C-Sharp Programming Top

« Previous Thread | Next Thread »

Posting Permissions

You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
[VIDEO] code is On
HTML code is Off

Forum Rules

Click Here to Expand Forum to Full Width

Featured

The Best Reasons to Target Windows 8

* Porting from Android to Windows 8: The Real Story
Do you have an Android application? How hard would it really be to port to Windows 8?
* Guide to Porting Android Applications to Windows 8
If you've already built for Android, learn what do you really need to know to port your application to Windows Phone 8.
* HTML5 Development Center
Our portal for articles, videos, and news on HTML5, CSS3, and JavaScript
* Windows App Gallery
See the Windows 8.x apps we've spotlighted or submit your own app to the gallery!

Thread: Manipulating the Wikipedia XML Dump file with C#

Thread Tools

Display

Threaded View

Manipulating the Wikipedia XML Dump file with C#

Posting Permissions