The bot attempts to provide the files with categories based on keywords, subject categories or MeSH terms supplied by the journal, by PubMed Central or by PubMed for the corresponding article. This sometimes leads to miscategorizations, often to overcategorization, and occasionally to no categories at all. At present, the files are spread over more than 20,000 categories, almost 10% of which had to be created on the occasion (e.g. for well over a hundred journals). Some of these categories (e.g. Caenorhabditis elegans or Green fluorescent proteins) are now filled with hundreds of files, which will eventually have to be distributed across more fine-grained categories. For many topics covered by PubMed Central (mostly biomedicine), there are thus now way more multimedia files available than the current Wikipedia entries (if they exist) can accommodate. For further examples, see actin cytoskeleton (on the English Wikipedia and on Commons), Gap junction (Wikipedia; Commons) or Woronin body (Wikipedia; Commons).
The review of the categorization of the files and of these new categories themselves continues - a process that you can facilitate by checking out (thanks to the overburdened Toolserver) a few of them and adding or removing categories as appropriate. If you can think of wiki pages where these files could be useful, please put them in there.
Bug fixes continued but shifted in focus from providing functionality to minimizing the effects of inconsistent and incorrect metadata available from PubMed Central.
Metadata at PubMed Central
The most prominent issue with the XML is that of incorrect or self-contradicting licensing statements. While this had been noticed already in spring (e.g. "licensed under a Creative Commons Attr0ibution 3.0 License (by-nc 3.0)" - yes, with a typo on top of it, in an article from Orthopedic Reviews, published by PagePress), actually deploying the bot to larger parts of the database made it clear that the phenomenon is rather common and not restricted to small and lesser known publishers.
While contradictions between machine-readable and human-readable license statements are one sort of problem, many journals - including those published by PLOS, which account for the majority of the bot's uploads so far - do not provide a license link at all or mix up the license and copyright tags in other ways. On a related note, even articles clearly and unambiguously labeled CC BY may occasionally contain materials incompatible with such licensing, and some articles in journals otherwise using Creative Commons licenses occasionally publish something under Crown copyright or similar conditions, causing the bot to skip the articles.
Such licensing mess raises a number of questions: if the licensing for a given article agrees in its human- and machine-readable version on PubMed Central, can we then be sure that this information is correct? This is the case, for instance, with the journal Molecular Vision. What if the same article does not have any licensing statement at the journal's site (or in the XML there), or if the journal's copyright policy states CC BY-NC-ND as the only option? What if Google finds several articles from the same journal that are also labeled as CC BY?
The mismatch between stated licenses and actual licensing conditions also makes it difficult to assess, in an automated fashion, what amount of audio, video or other materials is available from PubMed Central under Wikimedia-compatible licenses. For some plots on the matter, see this blog post, which also highlights another frequent issue: that of a mismatch between the actual MIME type and that stated in the XML, as in the following example:
As a rough estimate, MIME type mismatches of this kind affect on the order of 10% of the supplementary files in the database. Since this translates to hundreds of multimedia files, the bot now attempts to determine the MIME type of all supplementary materials and chooses those that are, in fact, audio or video, irrespective of what the XML states about them, thereby even covering cases in which the XML makes no statement about the MIME types . The bot naturally fails, however, in cases when suitably licensed articles do have supplementary multimedia files but these are not mentioned in the XML available from PMC (another case: journal; PMC)
Once suitably licensed multimedia have been identified as such, they have to be converted to a format accepted at Wikimedia Commons, i.e. OGG. This does not always work, since some authors use rather unusual file formats, or the metadata about the files (e.g. the length of a video) is incorrect or not stated at all. Most journals have a disclaimer that proper functioning of supplementary materials is within the authors' responsibility, but it would be nice to establish a standard for testing that supplementary files submitted to journals actually convert properly to common standard formats.
A problem not really solved so far is that of duplicate detection - while this works well for images, this is not the case for multimedia files, since multiple copies of a file will normally have different hashes.
The following files represent a selection of what has been uploaded by the Open Access Media Importer this month. If you can think of wiki pages where these files could be useful, please put them in there or let us know. For metadata about the files, please click on the Menu button.
Can you guess the research question addressed in the corresponding scholarly article?