Manipulating the Wikipedia XML Dump file with C#

**alnds** · January 13th, 2009, 04:09 AM

Hi guys,

I am in urgent need of processing the Wikipedia database dump file which is a large single XML file of approximately 4.7GB in size. I need to extract individual XML files from it. The wikipedia XML file has the following format.

=================================

<mediawiki xmlns="xxx" xmlns:xsi="xxx" xsi:schemaLocation="xxx" version="x" xml:lang="x">
<namespaces>
<namespace key="1">Talk</namespace>
<namespace key="2">User</namespace>
...
...
</namespaces>
<page>
<title>xxxxxx</title>
<id>6</id>
<revision>
<id>xxxxxx</id>
<timestamp>2007-05-25T17:12:06Z</timestamp>
<contributor>
<username>xxxx</username>
<id>xxxx</id>
</contributor>
<text>xxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxx
</text>
</revision>
</page>
<page>
...
...
...
</page>
<page>
...
...
...
</page>

=================================

What I need is to create individual XML files based on the page id element when I specify the id manually.

I have two questions regarding this.

1) Is C# the best way to go about in processing such a large XML file? Is there any other way to process this file?

2) How do I create individual XML files by extracting <page>.....</page> elements from the above file? How can I provide the list of page id's I need and extract individual XML files from this large files.

Any advice regarding these questions is much appreciated. Thanks in advance to anyone who is able to answer.

Cheers.

**darwen** · January 13th, 2009, 07:07 AM

1) You can use SAX rather than DOM to read the XML file. See the XmlTextReader family of classes. I'd say C# is as good as any other language at this problem.

2) Once you've identified the XML you want to extract you can use the XmlTextWriter to output the data to another XML file without having a memory overhead (like if you created a seperate XmlDocument class, added nodes to it and then saved at the end).

Darwen.

**alnds** · January 13th, 2009, 07:46 AM

Thanks darwen,

I will try them out and post the results here.

Update:
I just saw that SAX had not released for .NET 3.5. I am running .NET 3.5 on my machine. So would this be a problem if I go ahead with the 2.0 version?

**darwen** · January 13th, 2009, 08:18 AM

Where did you see that ? The XmlTextReader/XmlTextWriter classes do SAX as far as I'm aware.

Or at least they don't read the whole file into memory - DOM model.

And since .NET 3.5 is backwardly compatible with .NET 2.0 there's no problem with you using these classes.

Darwen.

Thread: Manipulating the Wikipedia XML Dump file with C#

Thread Tools

Display

Manipulating the Wikipedia XML Dump file with C#

Re: Manipulating the Wikipedia XML Dump file with C#

Re: Manipulating the Wikipedia XML Dump file with C#

Re: Manipulating the Wikipedia XML Dump file with C#

Posting Permissions