GLAM/Newsletter/November 2012/Contents/Open Access report
The Open Access Media Importer at full speed; Publishers deliver inconsistent XML to PubMed Central; Importing from other sources
Open Access Media Importer
After having been approved as a bot on Commons late in October, the Open Access Media Importer ran almost continuously throughout November, scanning PubMed Central for suitably licensed scholarly articles with supplementary multimedia and importing these into Wikimedia Commons, to a current total of well over 9,000, raising the number of files under Category:Open access (publishing) and its subcategories to almost 14,000 by the end of the month.
The bot attempts to provide the files with categories based on keywords, subject categories or MeSH terms supplied by the journal, by PubMed Central or by PubMed for the corresponding article. This sometimes leads to miscategorizations, often to overcategorization, and occasionally to no categories at all. At present, the files are spread over more than 20,000 categories, almost 10% of which had to be created on the occasion (e.g. for well over a hundred journals). Some of these categories (e.g. Caenorhabditis elegans or Green fluorescent proteins) are now filled with hundreds of files, which will eventually have to be distributed across more fine-grained categories. For many topics covered by PubMed Central (mostly biomedicine), there are thus now way more multimedia files available than the current Wikipedia entries (if they exist) can accommodate. For further examples, see actin cytoskeleton (on the English Wikipedia and on Commons), Gap junction (Wikipedia; Commons) or Woronin body (Wikipedia; Commons).
The review of the categorization of the files and of these new categories themselves continues - a process that you can facilitate by checking out (thanks to the overburdened Toolserver) a few of them and adding or removing categories as appropriate. If you can think of wiki pages where these files could be useful, please put them in there.
Bug fixes continued but shifted in focus from providing functionality to minimizing the effects of inconsistent and incorrect metadata available from PubMed Central.
Metadata at PubMed Central
The most prominent issue with the XML is that of incorrect or self-contradicting licensing statements. While this had been noticed already in spring (e.g. "licensed under a Creative Commons Attr0ibution 3.0 License (by-nc 3.0)" - yes, with a typo on top of it, in an article from Orthopedic Reviews, published by PagePress), actually deploying the bot to larger parts of the database made it clear that the phenomenon is rather common and not restricted to small and lesser known publishers.
The Open Access Media Importer analyzes the XML of articles stored in PubMed Central's Open Access subset. That XML is being delivered there by the individual journals or publishers, which provides the basis for a plethora of individual styles that may or may not be close to the actual specifications of the National Library of Medicine's Document Type Definition (now named JATS).
Besides the two journals highlighted in the figures, other journals affected by contradictory license statements include Evolutionary Applications (Wiley-Blackwell), Traffic (Copenhagen, Denmark) (Wiley-Blackwell), Cellular Microbiology (Wiley-Blackwell), Cytotheraphy (Informa), The American Journal of Tropical Medicine and Hygiene (American Society of Tropical Medicine and Hygiene), The Febs Journal (Wiley-Blackwell), Hepatology (Baltimore, Md.) (Wiley-Blackwell), Journal of Cellular Physiology (Wiley-Blackwell) and Database: The Journal of Biological Databases and Curation (Oxford University Press). At the Journal of Neurochemistry (Wiley-Blackwell), the self-contradictory notice "Re-use of this article is permitted in accordance with the Creative Commons Deed, Attribution 2.5, which does not permit commercial exploitation." is even displayed directly on the article's page. The closest match to the term "Creative Commons Deed, Attribution 2.5" would be the Creative Commons Attribution 2.5 Generic License (CC BY 2.5), which is indeed linked from the XML and does permit commercial exploitation.
While contradictions between machine-readable and human-readable license statements are one sort of problem, many journals - including those published by PLOS, which account for the majority of the bot's uploads so far - do not provide a license link at all or mix up the license and copyright tags in other ways. On a related note, even articles clearly and unambiguously labeled CC BY may occasionally contain materials incompatible with such licensing, and some articles in journals otherwise using Creative Commons licenses occasionally publish something under Crown copyright or similar conditions, causing the bot to skip the articles.
Such licensing mess raises a number of questions: if the licensing for a given article agrees in its human- and machine-readable version on PubMed Central, can we then be sure that this information is correct? This is the case, for instance, with the journal Molecular Vision. What if the same article does not have any licensing statement at the journal's site (or in the XML there), or if the journal's copyright policy states CC BY-NC-ND as the only option? What if Google finds several articles from the same journal that are also labeled as CC BY?
The mismatch between stated licenses and actual licensing conditions also makes it difficult to assess, in an automated fashion, what amount of audio, video or other materials is available from PubMed Central under Wikimedia-compatible licenses. For some plots on the matter, see this blog post, which also highlights another frequent issue: that of a mismatch between the actual MIME type and that stated in the XML, as in the following example:
As a rough estimate, MIME type mismatches of this kind affect on the order of 10% of the supplementary files in the database. Since this translates to hundreds of multimedia files, the bot now attempts to determine the MIME type of all supplementary materials and chooses those that are, in fact, audio or video, irrespective of what the XML states about them, thereby even covering cases in which the XML makes no statement about the MIME types . The bot naturally fails, however, in cases when suitably licensed articles do have supplementary multimedia files but these are not mentioned in the XML available from PMC (another case: journal; PMC)
Another reason preventing the import of some suitably licensed materials is that files are frequently hidden in zip archives, which the bot ignores for the time being.
Once suitably licensed multimedia have been identified as such, they have to be converted to a format accepted at Wikimedia Commons, i.e. OGG. This does not always work, since some authors use rather unusual file formats, or the metadata about the files (e.g. the length of a video) is incorrect or not stated at all. Most journals have a disclaimer that proper functioning of supplementary materials is within the authors' responsibility, but it would be nice to establish a standard for testing that supplementary files submitted to journals actually convert properly to common standard formats.
Further issues arise when the files are converted and need to be associated with their metadata in Commons style: Sometimes, there is no description whatsoever of supplementary files, or the description of several files is lumped together in a way that the bot cannot parse. Some minor issues include line breaks in article titles or typos in categories or keywords provided by the journal, which the bot uses for the initial categorization of the files.
A problem not really solved so far is that of duplicate detection - while this works well for images, this is not the case for multimedia files, since multiple copies of a file will normally have different hashes.
The following files represent a selection of what has been uploaded by the Open Access Media Importer this month. If you can think of wiki pages where these files could be useful, please put them in there or let us know. For metadata about the files, please click on the Menu button.
Can you guess the research question addressed in the corresponding scholarly article?
Can you guess what these sounds represent?