This page in a nutshell: This page summarizes the basic steps in data and media partnerships (for Wikidata and structured data on Wikimedia Commons), between Wikimedians and cultural institutions. It is meant to help globally understand the overall workflow and to give pointers to the most often-used tools.
Step in workflow
💡 Tips
(some things to think about during this phase)
🛠 Tools
(selection of software that can be used in this phase)
Negotiations between a GLAM partner and Wikimedia community members
Both sides can get to know each other by starting with smaller activities (e.g. an edit-a-thon or internal Wikimedia course).
Agreements about the co-operation can be made explicit in a Memorandum of Understanding. (Guide on how to create a MoU)
Data and media files are made available for Wikimedia Commons and/or Wikidata.
Website scraping/ingest tools (if the data is available online but the partner can't produce data exports from its database)
Tabula - open source tool to extract tables from PDF files
PAWS - Python programming notebook environment on Wikimedia Tools Lab that can transfer records from an institution's API
Media files' copyright must be compatible with Commons policy. (See Commons:Licensing for comprehensive information, and this infographic for a brief overview of how it works)
Clean up the data to be consistent and compatible with Wikimedia Commons and/or Wikidata.
Look at similar media or data items on Wikimedia Commons or Wikidata for inspiration how to model the data.
Wikidata's WikiProjects – the 'groups' where volunteers work together on common interests – often have recommendations on data modelling for specific subjects.
Spreadsheet software - allows non-programmers to run checks against existing Wikimedia content
Google Sheets - free spreadsheet software that can be collaborative
OpenRefine (formerly Google Refine) - popular tool for advanced data cleaning, transformation and matching against Wikidata content. Its homepage includes video tutorials and a guide on how to use version 3.0 and higher for Wikidata manipulation and uploading.
PAWS and Pywikibot - for those with some programming experience allows for large scale querying and advanced actions.
Always check which data and media items are already present on Wikidata and Wikimedia Commons.
Volunteers have often already autonomously uploaded quite a few images from GLAM collections.
Wikidata will probably already contain quite a few data items about creative works, people and topics related to specific GLAM collections.
On Wikimedia Commons, it is considered good practice to upload new (higher-quality) media files. Don't overwrite existing files.
On Wikidata, duplicate items must be avoided and merged when they are discovered. It is OK (and even highly recommended) to add extra sources and statements to existing items though.
Upload the new data items and/or media files to Wikidata and/or Commons.
Start with small test batches to check for structural errors.
Upload in manageable batches. Don't make your batches too large (hundreds rather than thousands) – correcting mistakes in thousands of data items or files at once is not fun.
Occasionally check uploads during the process, to prevent errors from propagating.
Wikimedia Commons:
Upload Wizard for simple uploads of up to 50 files. Offers no options for refined metadata.
Pattypan, a user-friendly batch upload tool that works with spreadsheets and that allows for refined details in metadata.
GLAMwiki Toolset, an advanced upload tool for XML feeds of large file batches. Requires days of lead time and a request for permission to use the tool.
Wikidata:
QuickStatements, create or update Wikidata items using tab-delimited or CSV files
OpenRefine (3.0+) tool that has powerful upload functionality for Wikidata
GLAMorgan shows Wikimedia page views for a specific Wikimedia Commons category for a specific month.
Fae's GLAM Dashboard, a set of templates that show interesting data about a Commons category, including the most edited files and the most active volunteers who have contributed to them.
Wikidata:
SPARQL Recent Changes, shows changes to items from a Wikidata query over a given period of time.