MWM/Getting SpecialExport data with LWP

From Bjoern Hassler

< MWM
Jump to: navigation, search

This fetches Special:Export pages with perl/LWP.

Contents

[edit] 1 Working out which pages to get

If you have a list of pages, you're done.

[edit] 1.1 By Category

The UNESCO OER wiki is not up to date, so category pages cannot be retrieved via the API, but need to be supplied manually. One option is to use perl/mechanize to click the 'Add Pages in Category' button, and then to retrieve. I might add a recipe for this.

With more recent installs of mediawiki, a list of pages within a category can be determined via the api:

api.php?action=query&list=categorymembers&cmtitle=Category:Access2OER

This is easily accomplished using MediaWiki::API:

[ View code | Edit code | Download]
use MediaWiki::API;
 
my $mw = MediaWiki::API->new();
$mw->{config}->{api_url} = 'http://.../api.php';
 
$mw->{config}->{on_error} = \&on_error;
 
sub on_error {
    print "Error code: " . $mw->{error}->{code} . "\n";
    print $mw->{error}->{stacktrace}."\n";
    die;
};
 
# get a list of articles in category                                                                                                                                                                    
my $articles = $mw->list ( {
    action => 'query',
    list => 'categorymembers',
    cmtitle => 'Category:Access2OER',
    cmlimit => 'max' } )
    || die $mw->{error}->{code} . ': ' . $mw->{error}->{details};
 
# and print the article titles                                                                                                                                                                          
foreach (@{$articles}) {
    print "$_->{title}\n";
}

[edit] 1.2 Page with dependencies

You can determine the templates in use on a particular page as follows:

api.php?action=query&prop=templates&titles=Main%20Page

[edit] 1.3 Subpages

It's possible to determine subpages using the api with apprefix. E.g. get all pages starting with 'Tutorials/' (i.e. proper subpages on Tutorials):

action=query&list=allpages&aplimit=100&apprefix=Tutorials/

You'd also need to add the 'Tutorials' page itself to the list. The above query won't catch the 'Tutorials' page itself.

[edit] 2 Getting the pages

When you have your list of pages, the following script gets them:

[ View code | Edit code | Download]
#!/path/to/perl
use strict;
use LWP::UserAgent;
use HTTP::Request::Common;
 
my $myurl = "http://oerwiki.iiep-unesco.org/index.php?title=Special:Export";
my $pages;
 
while (<STDIN>) {
    $pages .= $_;
};
 
my %formfields = (
    "pages" => $pages,
    "curonly" => "true",
    "action" => "submit",
    "submit" => "Export"
    );
 
my $ua = new LWP::UserAgent;
 
$ua->protocols_allowed( [ 'http'] );
my $page = $ua->request(POST $myurl,\%formfields);
 
(my $date = `date`)=~ s/[\/\n]//g;
 
if ($page->is_success) {
    open F,">Special Export $category $date.xml";
    print F $page->content;
    close F;
    print "Done.\n";
} else {
    print $page->message;
}