alnds
January 13th, 2009, 03:09 AM
Hi guys,
I am in urgent need of processing the Wikipedia database dump file which is a large single XML file of approximately 4.7GB in size. I need to extract individual XML files from it. The wikipedia XML file has the following format.
=================================
<mediawiki xmlns="xxx" xmlns:xsi="xxx" xsi:schemaLocation="xxx" version="x" xml:lang="x">
<namespaces>
<namespace key="1">Talk</namespace>
<namespace key="2">User</namespace>
...
...
</namespaces>
<page>
<title>xxxxxx</title>
<id>6</id>
<revision>
<id>xxxxxx</id>
<timestamp>2007-05-25T17:12:06Z</timestamp>
<contributor>
<username>xxxx</username>
<id>xxxx</id>
</contributor>
<text>xxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxx
</text>
</revision>
</page>
<page>
...
...
...
</page>
<page>
...
...
...
</page>
=================================
What I need is to create individual XML files based on the page id element when I specify the id manually.
I have two questions regarding this.
1) Is C# the best way to go about in processing such a large XML file? Is there any other way to process this file?
2) How do I create individual XML files by extracting <page>.....</page> elements from the above file? How can I provide the list of page id's I need and extract individual XML files from this large files.
Any advice regarding these questions is much appreciated. Thanks in advance to anyone who is able to answer.
Cheers.
I am in urgent need of processing the Wikipedia database dump file which is a large single XML file of approximately 4.7GB in size. I need to extract individual XML files from it. The wikipedia XML file has the following format.
=================================
<mediawiki xmlns="xxx" xmlns:xsi="xxx" xsi:schemaLocation="xxx" version="x" xml:lang="x">
<namespaces>
<namespace key="1">Talk</namespace>
<namespace key="2">User</namespace>
...
...
</namespaces>
<page>
<title>xxxxxx</title>
<id>6</id>
<revision>
<id>xxxxxx</id>
<timestamp>2007-05-25T17:12:06Z</timestamp>
<contributor>
<username>xxxx</username>
<id>xxxx</id>
</contributor>
<text>xxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxx
</text>
</revision>
</page>
<page>
...
...
...
</page>
<page>
...
...
...
</page>
=================================
What I need is to create individual XML files based on the page id element when I specify the id manually.
I have two questions regarding this.
1) Is C# the best way to go about in processing such a large XML file? Is there any other way to process this file?
2) How do I create individual XML files by extracting <page>.....</page> elements from the above file? How can I provide the list of page id's I need and extract individual XML files from this large files.
Any advice regarding these questions is much appreciated. Thanks in advance to anyone who is able to answer.
Cheers.