|
-
January 13th, 2009, 04:09 AM
#1
Manipulating the Wikipedia XML Dump file with C#
Hi guys,
I am in urgent need of processing the Wikipedia database dump file which is a large single XML file of approximately 4.7GB in size. I need to extract individual XML files from it. The wikipedia XML file has the following format.
=================================
<mediawiki xmlns="xxx" xmlns:xsi="xxx" xsi:schemaLocation="xxx" version="x" xml:lang="x">
<namespaces>
<namespace key="1">Talk</namespace>
<namespace key="2">User</namespace>
...
...
</namespaces>
<page>
<title>xxxxxx</title>
<id>6</id>
<revision>
<id>xxxxxx</id>
<timestamp>2007-05-25T17:12:06Z</timestamp>
<contributor>
<username>xxxx</username>
<id>xxxx</id>
</contributor>
<text>xxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxx
</text>
</revision>
</page>
<page>
...
...
...
</page>
<page>
...
...
...
</page>
=================================
What I need is to create individual XML files based on the page id element when I specify the id manually.
I have two questions regarding this.
1) Is C# the best way to go about in processing such a large XML file? Is there any other way to process this file?
2) How do I create individual XML files by extracting <page>.....</page> elements from the above file? How can I provide the list of page id's I need and extract individual XML files from this large files.
Any advice regarding these questions is much appreciated. Thanks in advance to anyone who is able to answer.
Cheers.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|