|
-
January 13th, 2009, 04:09 AM
#1
Manipulating the Wikipedia XML Dump file with C#
Hi guys,
I am in urgent need of processing the Wikipedia database dump file which is a large single XML file of approximately 4.7GB in size. I need to extract individual XML files from it. The wikipedia XML file has the following format.
=================================
<mediawiki xmlns="xxx" xmlns:xsi="xxx" xsi:schemaLocation="xxx" version="x" xml:lang="x">
<namespaces>
<namespace key="1">Talk</namespace>
<namespace key="2">User</namespace>
...
...
</namespaces>
<page>
<title>xxxxxx</title>
<id>6</id>
<revision>
<id>xxxxxx</id>
<timestamp>2007-05-25T17:12:06Z</timestamp>
<contributor>
<username>xxxx</username>
<id>xxxx</id>
</contributor>
<text>xxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxx
</text>
</revision>
</page>
<page>
...
...
...
</page>
<page>
...
...
...
</page>
=================================
What I need is to create individual XML files based on the page id element when I specify the id manually.
I have two questions regarding this.
1) Is C# the best way to go about in processing such a large XML file? Is there any other way to process this file?
2) How do I create individual XML files by extracting <page>.....</page> elements from the above file? How can I provide the list of page id's I need and extract individual XML files from this large files.
Any advice regarding these questions is much appreciated. Thanks in advance to anyone who is able to answer.
Cheers.
-
January 13th, 2009, 07:07 AM
#2
Re: Manipulating the Wikipedia XML Dump file with C#
1) You can use SAX rather than DOM to read the XML file. See the XmlTextReader family of classes. I'd say C# is as good as any other language at this problem.
2) Once you've identified the XML you want to extract you can use the XmlTextWriter to output the data to another XML file without having a memory overhead (like if you created a seperate XmlDocument class, added nodes to it and then saved at the end).
Darwen.
-
January 13th, 2009, 07:46 AM
#3
Re: Manipulating the Wikipedia XML Dump file with C#
Thanks darwen,
I will try them out and post the results here.
Update:
I just saw that SAX had not released for .NET 3.5. I am running .NET 3.5 on my machine. So would this be a problem if I go ahead with the 2.0 version?
Last edited by alnds; January 13th, 2009 at 07:58 AM.
Reason: Update
-
January 13th, 2009, 08:18 AM
#4
Re: Manipulating the Wikipedia XML Dump file with C#
Where did you see that ? The XmlTextReader/XmlTextWriter classes do SAX as far as I'm aware.
Or at least they don't read the whole file into memory - DOM model.
And since .NET 3.5 is backwardly compatible with .NET 2.0 there's no problem with you using these classes.
Darwen.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|