GLAM/Newsletter/November 2012/Contents/Open Access report
|
The Open Access Media Importer at full speed; Publishers deliver inconsistent XML to PubMed Central; Importing from other sources
Open Access Media Importer
After having been approved as a bot on Commons late in October, the Open Access Media Importer ran almost continuously throughout November, scanning PubMed Central for suitably licensed scholarly articles with supplementary multimedia and importing these into Wikimedia Commons, to a current total of well over 9,000, raising the number of files under Category:Open access (publishing) and its subcategories to almost 14,000 by the end of the month.
The bot attempts to provide the files with categories based on keywords, subject categories or MeSH terms supplied by the journal, by PubMed Central or by PubMed for the corresponding article. This sometimes leads to miscategorizations, often to overcategorization, and occasionally to no categories at all. At present, the files are spread over more than 20,000 categories, almost 10% of which had to be created on the occasion (e.g. for well over a hundred journals). Some of these categories (e.g. Caenorhabditis elegans or Green fluorescent proteins) are now filled with hundreds of files, which will eventually have to be distributed across more fine-grained categories. For many topics covered by PubMed Central (mostly biomedicine), there are thus now way more multimedia files available than the current Wikipedia entries (if they exist) can accommodate. For further examples, see actin cytoskeleton (on the English Wikipedia and on Commons), Gap junction (Wikipedia; Commons) or Woronin body (Wikipedia; Commons).
The review of the categorization of the files and of these new categories themselves continues - a process that you can facilitate by checking out (thanks to the overburdened Toolserver) a few of them and adding or removing categories as appropriate. If you can think of wiki pages where these files could be useful, please put them in there.
Bug fixes continued but shifted in focus from providing functionality to minimizing the effects of inconsistent and incorrect metadata available from PubMed Central.
Metadata at PubMed Central
The most prominent issue with the XML is that of incorrect or self-contradicting licensing statements. While this had been noticed already in spring (e.g. "licensed under a Creative Commons Attr0ibution 3.0 License (by-nc 3.0)" - yes, with a typo on top of it, in an article from Orthopedic Reviews, published by PagePress), actually deploying the bot to larger parts of the database made it clear that the phenomenon is rather common and not restricted to small and lesser known publishers.
The Open Access Media Importer analyzes the XML of articles stored in PubMed Central's Open Access subset. That XML is being delivered there by the individual journals or publishers, which provides the basis for a plethora of individual styles that may or may not be close to the actual specifications of the National Library of Medicine's Document Type Definition (now named JATS).
Besides the two journals highlighted in the figures, other journals affected by contradictory license statements include Evolutionary Applications (Wiley-Blackwell), Traffic (Copenhagen, Denmark) (Wiley-Blackwell), Cellular Microbiology (Wiley-Blackwell), Cytotheraphy (Informa), The American Journal of Tropical Medicine and Hygiene (American Society of Tropical Medicine and Hygiene), The Febs Journal (Wiley-Blackwell), Hepatology (Baltimore, Md.) (Wiley-Blackwell), Journal of Cellular Physiology (Wiley-Blackwell) and Database: The Journal of Biological Databases and Curation (Oxford University Press). At the Journal of Neurochemistry (Wiley-Blackwell), the self-contradictory notice "Re-use of this article is permitted in accordance with the Creative Commons Deed, Attribution 2.5, which does not permit commercial exploitation." is even displayed directly on the article's page. The closest match to the term "Creative Commons Deed, Attribution 2.5" would be the Creative Commons Attribution 2.5 Generic License (CC BY 2.5), which is indeed linked from the XML and does permit commercial exploitation.
While contradictions between machine-readable and human-readable license statements are one sort of problem, many journals - including those published by PLOS, which account for the majority of the bot's uploads so far - do not provide a license link at all or mix up the license and copyright tags in other ways. On a related note, even articles clearly and unambiguously labeled CC BY may occasionally contain materials incompatible with such licensing, and some articles in journals otherwise using Creative Commons licenses occasionally publish something under Crown copyright or similar conditions, causing the bot to skip the articles.
Such licensing mess raises a number of questions: if the licensing for a given article agrees in its human- and machine-readable version on PubMed Central, can we then be sure that this information is correct? This is the case, for instance, with the journal Molecular Vision. What if the same article does not have any licensing statement at the journal's site (or in the XML there), or if the journal's copyright policy states CC BY-NC-ND as the only option? What if Google finds several articles from the same journal that are also labeled as CC BY?
The mismatch between stated licenses and actual licensing conditions also makes it difficult to assess, in an automated fashion, what amount of audio, video or other materials is available from PubMed Central under Wikimedia-compatible licenses. For some plots on the matter, see this blog post, which also highlights another frequent issue: that of a mismatch between the actual MIME type and that stated in the XML, as in the following example:
As a rough estimate, MIME type mismatches of this kind affect on the order of 10% of the supplementary files in the database. Since this translates to hundreds of multimedia files, the bot now attempts to determine the MIME type of all supplementary materials and chooses those that are, in fact, audio or video, irrespective of what the XML states about them, thereby even covering cases in which the XML makes no statement about the MIME types . The bot naturally fails, however, in cases when suitably licensed articles do have supplementary multimedia files but these are not mentioned in the XML available from PMC (another case: journal; PMC)
Another reason preventing the import of some suitably licensed materials is that files are frequently hidden in zip archives, which the bot ignores for the time being.
Once suitably licensed multimedia have been identified as such, they have to be converted to a format accepted at Wikimedia Commons, i.e. OGG. This does not always work, since some authors use rather unusual file formats, or the metadata about the files (e.g. the length of a video) is incorrect or not stated at all. Most journals have a disclaimer that proper functioning of supplementary materials is within the authors' responsibility, but it would be nice to establish a standard for testing that supplementary files submitted to journals actually convert properly to common standard formats.
Further issues arise when the files are converted and need to be associated with their metadata in Commons style: Sometimes, there is no description whatsoever of supplementary files, or the description of several files is lumped together in a way that the bot cannot parse. Some minor issues include line breaks in article titles or typos in categories or keywords provided by the journal, which the bot uses for the initial categorization of the files.
A problem not really solved so far is that of duplicate detection - while this works well for images, this is not the case for multimedia files, since multiple copies of a file will normally have different hashes.
Gallery
The following files represent a selection of what has been uploaded by the Open Access Media Importer this month. If you can think of wiki pages where these files could be useful, please put them in there or let us know. For metadata about the files, please click on the Menu button.
Videos
Can you guess the research question addressed in the corresponding scholarly article?
Sound files
Can you guess what these sounds represent?
Beyond PubMed Central
While PubMed Central is the only database currently spidered by the Open Access Media Importer, it is designed in a modular fashion, such that other sources could easily be plugged in. To lay the ground for such future work, a number of (manual) test uploads from such potential sources have been made this month.
-
The metamorphosis of the protochordate Branchiostoma japonicum - the first file on Wikimedia Commons originating from a data paper (published in Dataset Papers in Biology).
-
A "singing" iceberg - the first file from a data repository (PANGAEA).
-
Bugula flabellata - another file from a data repository (Dryad).
WikiProject Open Access
The following news from WikiProject Open Access have been posted this month:
- November 2: A video of a juvenile tentacled snake attacking a fish (a fathead minnow) is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in PLOS ONE in 2010 and uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, video, Tentacled snake article, fathead minnow article.
- November 4: An X-ray video of an American alligator while breathing is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in PLOS ONE in 2009 and uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, video, American alligator article, X-ray article.
- November 6: A video of a raspy cricket fabricating silk is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in PLOS ONE in February and uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, video, Raspy cricket article, Silk article.
- November 8: A video of chimpanzees sharing a papaya fruit is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in PLOS ONE in 2007 and uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, video, Chimpanzee article, Papaya article.
- November 10: An audio recording of the giggling call of a spotted hyena is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in BMC Ecology in 2010 and uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, sound file, Giggling article, Spotted hyena article.
- November 11: A video of Emperor Penguins producing traveling waves is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in PLOS ONE in 2011 and uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, video, Emperor penguin article, Traveling wave article.
- November 11: Open Access report in the October 2012 issue of This month in GLAM. Traffic stats.
- November 13: A video of a pelagic thresher shark and a giant manta ray interacting in the presence of cleaner fish is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in PLOS ONE in 2011 and uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, video, Pelagic thresher shark article, Giant manta ray article, Cleaner fish article.
- November 15: A video of a moth caterpillar defending itself and parasitoid wasp pupae against a stink bug is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in PLOS ONE in 2008 and uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, video, Thyrinteina article, Glyptapanteles article, Pentatomoidea article.
- November 16: A video of a Meioglossus psammophilus acorn worm is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in PLOS ONE last week and uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, video, Acorn worm article.
- November 16: The Open Access Media Importer Bot passed the mark of 5000 imported files.
- November 17: An audio recording of an iceberg is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in 2005 in PANGAEA as part of the supplement to a paper published in Science. Traffic stats: Main Page, sound file, iceberg article.
- November 19: An audio recording of calling song of a 17-year periodic cicada Magicicada septendecula is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in 2007 in PLOS ONE and uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, sound file, Magicicada septendecula article.
- November 20: A video - recorded with an acoustic camera - of an African bush elephant rumbling with its nose is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in PLOS ONE last week and uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, video, African bush elephant article.
- November 22: A video of a Manduca sexta larva on Nicotiana attenuata, reacting to experimental stimulation, is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in eLife last month and uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, video, Manduca sexta article, Nicotiana attenuata article.
- November 26: An audio recording of a territorial call of a male toad Atelopus franciscus is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in 2011 in PLOS ONE and uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, sound file, Atelopus franciscus article.
- November 27: A video of astronaut Charles Duke trying to recover a hammer he had dropped on the surface of the moon is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally recorded during the Apollo 16 mission, then republished in PLOS ONE in 2009 and recently uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, video, Charles Duke article, Apollo 16 article.
- November 28: A video of the solitary bee Colletes cunicularius pseudocopulating on the flower labellum of the orchid Ophrys lupercalis is featured on the Main Page of Wikimedia Commons under Media of the Day. It was originally published in BMC Evolutionary Biology in 2010 and recently uploaded to Commons by the Open Access Media Importer Bot. Traffic stats: Main Page, video, Colletes article, Pseudocopulation article, Ophrys article.
- November 30: The Open Access Media Importer is featured in the Wikimedia Highlights of October 2012.
Open Access File of the Day
The following files have been featured as Open Access File of the Day this month:
-
November 30: A Magicicada cassini female laying its eggs.
-
November 29: A hydrophilic termite (Schedorhinotermes sp.) attached to the surface of a wetted citrus leaf
-
November 28: A range of putative disease-causing mechanisms for the case of the disease progeria
-
November 27: A large male Bolitotherus cornutus undergoing a grip strength trial.
-
November 26: A female Euglossa hyacinthina working on the construction of her nest envelope.
-
November 25: Reconstruction of a Nimbadon lavarackorum mother with a juvenile.
-
November 24: SEM image of the tardigrade Milnesium tardigradum in its active state.
-
November 23: The brown lacewing Micromus variegatus.
-
November 22: Carpodacus mexicanus (finch) vocalizations
-
November 21: The fish Alticus arnoldorum performing a jump (slow motion).
-
November 20: Larsenianthus arunachalensis, a member of the ginger family.
-
November 19: An Atlantic Yellow-nosed Albatross.
-
November 18: A Xenopus laevis female with egg batch and Xenopus tropicalis male.
-
November 17: Acoustic display of the hummingbird Archilochus alexandri.
-
November 16: Thyrinteina caterpillar parasitized by Glyptapanteles pupae defends against Supputius cincticeps
-
November 15: Anodonthyla theoi, dorsolateral and ventral views
-
November 14: A female Illacme plenipes with 662 legs.
-
November 13: Aerosteon riocoloradensis
-
November 12: maxilla from Fruitadens
-
November 11: The Garden centipede Scutigerella immaculata.
-
November 10: A female tarantula Acanthoscurria gomesiana.
-
November 9: drawing of female of Glomeris marginata
-
November 8: Centropyge bicolor feeding on Palaemonetes
-
November 7: Archerfish shooting at prey
-
November 6: The butterfly Erebia calcaria.
-
November 5: gastrulation in cybrid embryos
-
November 4: The sea slug Spurilla major.
-
November 3: Major metabolic interactions betwen astrocytes and neurons.
-
November 2: Venogram showing cerebral venous sinus thrombosis in a patient with Behçet's disease
-
November 1: MRI scans of a microcephalic patient (right) and a healthy control (left).