GLAM/Newsletter/September 2014/Contents/Special story
|
Identifying images from Commons found in the wild
ByIntroduction
Commons Machinery is an initiative by the Shuttleworth Foundation that works to persistently associate the context of an image -- like the images found on Commons -- with the image itself. The context of an image, such as who authored it, the license attached to it and where it originally comes from, is important to establish some meaning or value to images that would otherwise be purely random work of art spread around the vast Internet.
Without any additional information about this image (right) of some cats, the image is just that - a picture of cats. And we all know how many cats there are on the Internet today! But knowing that this is the HMS Hawkins ship's cats peering out of the muzzle of a 7.5 inch gun, that it's from the National Maritime Museum collection and that it has no known copyright restrictions means that we can place the image in the cultural context it belongs. We relate to it differently when we know, and more importantly, we get enough information to know that we can re-use this on our own blog if we wish, and can find enough details to attribute it correctly. What more is, if our computers can read and interpret this information for us, as it mostly can from Commons and Flickr, we can even get some help to attribute automatically. And this is where the CommonsHasher and Elog.io comes in.
The CommonsHasher & Elog.io
The CommonsHasher is a bot that currently runs on all images on Commons. Every hour, it takes about 9000 images (at current speed), gets the basic metadata about those images, and computes what is known as a perceptual hash. A perceptual hash can be seen as a very large number that can uniquely identify an image. If you run the same computation on an image, even after it has been resized, or changed from a png to a jpg, it should result in the same big number. Or a number that's very close to it. Commons Machinery is using a very simple blockhash algorithm for this, which can run both in a browser and in standalone applications.
Having a database of such hashes means that we can relate an image to an image from Commons, even if it's found on another web site, and even if it's been resized or re-coded to a different image format. Over time, the hashes we calculate will be contributed back to Wikidata, once Wikidata starts including more and more information about images from Commons. In the mean time, Commons Machinery is hosting our own API to that data, which will soon be published on Elog.io. The latter will also include a browser plugin for Firefox and Chrome which looks at images that you encounter while browsing and provides a visual mark on the images that come from Wikimedia Commons, and provides a way to find your way back to Wikimedia Commons to get more information about them.
You can find a sneak preview of the plugin, and the current roadmap is to publish a first version sometime in November. Feel free to join the discussions by commenting below, joining us on IRC in #commonsmachinery on Freenode, or just signing up for Elog.io to get more information as we make it available.
Hm, almost 6 months to hash the whole Commons? And how about the rest of the images of the Internet? :) I guess it's useful for internal comparisons or to compare one file at a time with the whole Commons dataset (i.e. to produce attribution), but not much for chasing copyvios etc., right? --Nemo 21:50, 11 October 2014 (UTC)
- We're cautious not to overload the servers, but we'll start adding more processing nodes to it soon enough. Our aim is to do about a million images per day, which will allow us to tackle other open resources too aside from Commons - thanks very much for noticing and paying attention to that we can't settle on just Commons :) And yes, this is definitely not for chasing copyright violations, we're not particularly interested in that aspect. We feel that if we can help in ensuring that images re-used are attributed accurately, the need to chase down copyright violations might also be less. --Jonasoberg (talk) 22:52, 11 October 2014 (UTC)
An immediate use case that comes to mind is having a version of the "Duplicates of this file" section of the image page, but actually giving what the user expects instead of "Exact" duplicates. Such a db could also be useful to tag images that are similar to previously deleted images as needing review. Bawolff (talk) 22:41, 11 October 2014 (UTC)
- That wouldn't be terribly difficult it seems. Remember though that we're talking about more-or-less exact copies, where nothing has changed except for the image size and perhaps the file format. For more advanced matching, capturing also when images have been cropped or otherwise changed, it may be easy enough to just direct the user to a Google Image Reverse Search with the parameter "site:commons.wikimedia.org". As long as we stay in the more-or-less exact copies space, our db has a public API so once we have a bit more of Commons hashed, it wouldn't be a problem to send queries through it. (For the moment, we'd have to make do with our API/service. We'll eventually dedicate the hashes back to Commons -- we're carefully following the Structured Data discussions -- but searching them requires a few more tricks since we define matches as images which has hashes with hamming distance of at most 10 bits. It's a near-match search, rather than a exact search, so just having the hashes in WikiData or on Commons doesn't mean that you can easily search them) --Jonasoberg (talk) 22:52, 11 October 2014 (UTC)