CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 6 of 6

Thread: How to build a Wikipedia category graph without downloading everything

  1. #1
    Join Date
    Mar 2010
    Posts
    11

    How to build a Wikipedia category graph without downloading everything

    I would like to make a list of Wikipedia categories, and for every category the list of subcategories and page titles, like:

    ("Food" ("Fruit" "Grain" "Meat" "Vegetable"))
    ("Fruit" ("Apple" "Pear"))
    ("Grain" ("Wheat"))
    ("Meat" ("Sausage" "Steak"))
    ("Vegetable" ("Potato" "Sprout"))

    How can I make such a list?
    I've been looking at some data dumps, but the complete data dump is too big for me.
    The smaller data dumps don't seem to contain enough information.
    There's a list of page titles.
    There's a list of category titles.
    But I don't see which page belongs in which category.
    I don't see which category is a subcategory of which category.
    Is there a data dump of a few MB that I can use to build this list?
    Is there somewhere on the internet such a list or graph that I can download?

  2. #2
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,016

    Re: How to build a Wikipedia category graph without downloading everything

    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  3. #3
    Join Date
    Mar 2010
    Posts
    11

    Re: How to build a Wikipedia category graph without downloading everything

    I saw that already.
    It doesn't give the entire graph.
    But I searched for "largest connected component Wikipedia" and I found a 36 MB download.
    I hope this one will be OK.

  4. #4
    Join Date
    Mar 2010
    Posts
    11

    Re: How to build a Wikipedia category graph without downloading everything

    No, that didn't work.

    The first thing that I tried was to download enwiki-20100130-categorylinks.sql.gz
    Then I saw that it couldn't be expanded because the file was 'corrupt'.
    I calculated a MD5: bde58e4f0b587628ec415d5aed8927aa.

    When I compared to the MD5 on wikipedia:
    download.wikimedia.org/enwiki/20100130/enwiki-20100130-md5sums.txt
    f20fee309f8a24990ebd1b44365756af enwiki-20100130-site_stats.sql.gz
    67306b6c4fcb212986dc370839052d19 enwiki-20100130-image.sql.gz
    4e08666f590d366513a671095666880c enwiki-20100130-oldimage.sql.gz
    85aa9c506312777e76e328c0616a3a47 enwiki-20100130-pagelinks.sql.gz
    8b5055115adbb19d96865113cc24c230 enwiki-20100130-categorylinks.sql.gz
    25f3220ea3e49c260d6d3879901a3306 enwiki-20100130-imagelinks.sql.gz
    8bb4ffafe5a1358034e89bb3127c18fc enwiki-20100130-templatelinks.sql.gz
    e32d799e86f8d2be5ebb3eed048238c7 enwiki-20100130-externallinks.sql.gz
    e32d799e86f8d2be5ebb3eed048238c7 enwiki-20100130-langlinks.sql.gz
    e32d799e86f8d2be5ebb3eed048238c7 enwiki-20100130-interwiki.sql.gz
    e32d799e86f8d2be5ebb3eed048238c7 enwiki-20100130-user_groups.sql.gz
    e32d799e86f8d2be5ebb3eed048238c7 enwiki-20100130-category.sql.gz
    e32d799e86f8d2be5ebb3eed048238c7 enwiki-20100130-page.sql.gz
    e32d799e86f8d2be5ebb3eed048238c7 enwiki-20100130-page_restrictions.sql.gz
    e32d799e86f8d2be5ebb3eed048238c7 enwiki-20100130-page_props.sql.gz
    e32d799e86f8d2be5ebb3eed048238c7 enwiki-20100130-protected_titles.sql.gz
    e32d799e86f8d2be5ebb3eed048238c7 enwiki-20100130-redirect.sql.gz
    e32d799e86f8d2be5ebb3eed048238c7 enwiki-20100130-all-titles-in-ns0.gz
    7f81caa9975c8a4b3bb88e6a1a5cdfbf enwiki-20100130-abstract.xml
    788f875bc8f0e46fbdd7743bc30d90be enwiki-20100130-stub-meta-history.xml.gz
    c5a7c03c4e40639ef7eeca5aed298cc8 enwiki-20100130-stub-meta-current.xml.gz
    45e63b0380c6b411e3a902dc68905ed7 enwiki-20100130-stub-articles.xml.gz
    f078255570c07330edf2e2b68ef229cf enwiki-20100130-pages-articles.xml.bz2
    340e7233bcfa705bd0518a7d039b4fe2 enwiki-20100130-pages-meta-current.xml.bz2
    790e17f26f0a1101221b8fefff670abc enwiki-20100130-pages-logging.xml.gz
    65677bc275442c7579857cc26b355ded enwiki-20100130-pages-meta-history.xml.bz2
    da705eaf7a7e91bda803b256f3a8bf1b enwiki-20100130-pages-meta-history.xml.7z

    then there seemed to be something wrong:
    many files have the same MD5: e32d799e86f8d2be5ebb3eed048238c7.

    Now I try to download a newer categorylinks.sql.gz

  5. #5
    Join Date
    Mar 2010
    Posts
    11

    Re: How to build a Wikipedia category graph without downloading everything

    wiki.dbpedia.org/Downloads
    -> Articles Categories = 115 MB -> 1.7 GB = 12.000.000 records.
    -> Categories (Labels) = 7 MB -> 89 MB = 632.600 records.
    -> Categories (SKOS) = 18 MB -> 408 MB = 2.500.000 records.

  6. #6
    Join Date
    Mar 2011
    Posts
    1

    Re: How to build a Wikipedia category graph without downloading everything

    Thank you so much
    you saved my life. I was struggling with enwiki-20110115-categorylinks.sql....

    Quote Originally Posted by Cliff Huylebroeck View Post
    wiki.dbpedia.org/Downloads
    -> Articles Categories = 115 MB -> 1.7 GB = 12.000.000 records.
    -> Categories (Labels) = 7 MB -> 89 MB = 632.600 records.
    -> Categories (SKOS) = 18 MB -> 408 MB = 2.500.000 records.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Windows Mobile Development Center


Click Here to Expand Forum to Full Width




On-Demand Webinars (sponsored)