CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 4 of 4
  1. #1
    Join Date
    Jan 2009
    Posts
    3

    Manipulating the Wikipedia XML Dump file with C#

    Hi guys,

    I am in urgent need of processing the Wikipedia database dump file which is a large single XML file of approximately 4.7GB in size. I need to extract individual XML files from it. The wikipedia XML file has the following format.

    =================================

    <mediawiki xmlns="xxx" xmlns:xsi="xxx" xsi:schemaLocation="xxx" version="x" xml:lang="x">
    <namespaces>
    <namespace key="1">Talk</namespace>
    <namespace key="2">User</namespace>
    ...
    ...
    </namespaces>
    <page>
    <title>xxxxxx</title>
    <id>6</id>
    <revision>
    <id>xxxxxx</id>
    <timestamp>2007-05-25T17:12:06Z</timestamp>
    <contributor>
    <username>xxxx</username>
    <id>xxxx</id>
    </contributor>
    <text>xxxxxxxxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxxxxxxxxxxx
    </text>
    </revision>
    </page>
    <page>
    ...
    ...
    ...
    </page>
    <page>
    ...
    ...
    ...
    </page>

    =================================

    What I need is to create individual XML files based on the page id element when I specify the id manually.

    I have two questions regarding this.

    1) Is C# the best way to go about in processing such a large XML file? Is there any other way to process this file?

    2) How do I create individual XML files by extracting <page>.....</page> elements from the above file? How can I provide the list of page id's I need and extract individual XML files from this large files.

    Any advice regarding these questions is much appreciated. Thanks in advance to anyone who is able to answer.

    Cheers.

  2. #2
    Join Date
    Jan 2002
    Location
    Scaro, UK
    Posts
    5,940

    Re: Manipulating the Wikipedia XML Dump file with C#

    1) You can use SAX rather than DOM to read the XML file. See the XmlTextReader family of classes. I'd say C# is as good as any other language at this problem.

    2) Once you've identified the XML you want to extract you can use the XmlTextWriter to output the data to another XML file without having a memory overhead (like if you created a seperate XmlDocument class, added nodes to it and then saved at the end).

    Darwen.
    www.pinvoker.com - PInvoker - the .NET PInvoke Interface Exporter for C++ Dlls.

  3. #3
    Join Date
    Jan 2009
    Posts
    3

    Re: Manipulating the Wikipedia XML Dump file with C#

    Thanks darwen,

    I will try them out and post the results here.

    Update:
    I just saw that SAX had not released for .NET 3.5. I am running .NET 3.5 on my machine. So would this be a problem if I go ahead with the 2.0 version?
    Last edited by alnds; January 13th, 2009 at 07:58 AM. Reason: Update

  4. #4
    Join Date
    Jan 2002
    Location
    Scaro, UK
    Posts
    5,940

    Re: Manipulating the Wikipedia XML Dump file with C#

    Where did you see that ? The XmlTextReader/XmlTextWriter classes do SAX as far as I'm aware.

    Or at least they don't read the whole file into memory - DOM model.

    And since .NET 3.5 is backwardly compatible with .NET 2.0 there's no problem with you using these classes.

    Darwen.
    www.pinvoker.com - PInvoker - the .NET PInvoke Interface Exporter for C++ Dlls.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured