Public Policy Initiative Evaluation and Research
|This page is an archive. Do not edit the contents of this page. Direct any additional comments to the current talk page.|
- 1 Overview
- 2 Analysis Components
- 2.1 Metric Assessment
- 2.2 Public Policy Content Improvement & Article Quality Assessment
- 2.3 Basic Descriptive Statistics for the Project
- 2.4 Categorization of Public Policy Articles
- 2.5 Focus Groups
- 2.6 Analysis of Curriculum & Development of Wikipedia as a University Teaching Tool
- 3 Related Pages
The research and analysis plan for the Public Policy Initiative (PPI) is still under development, and will depend on the involvement and roles defined by the collaborating universities and the invested Wikipedians. This page will be the platform for discussions related to analysis, assessment, evaluation, and research for the project. Wikipedians, students, professors, and researchers are encouraged to participate in discussions and offer tips and tools that may facilitate project research.
- To improve the content of United States Public Policy within Wikipedia
- To recruit and retain new subject matter expert contributors in the area of public policy
- To develop a curriculum of Wikipedia as a teaching tool for use in university classrooms
- To create a support system within the Wikipedia community for new subject matter expert contributors that is sustainable after the project ends
- To develop a reproducible working model for engaging experts in any subject area, recruiting new contributors, improving content in that area, and gaining support for these activities from the Wikipedia community
- Did Wikipedia content within the scope of U.S. public policy improve as a result of the Public Policy Initiative?
- Did the Public Policy Initiative establish a successful model of collaboration between Wikimedia Foundation, academic experts, and the Wikipedia community for the purposes of content improvement in U.S. public policy and integrating Wikipedia as a university teaching tool? Is this model transferable to other subject areas?
- What evidence exists that the deliverables of the Public Policy Initiative will continue in the future? Are the results sustainable?
Overlap with Larger Issues in Wikipedia
Even though the subject area of the project is focused on public policy, it is important that the evaluation capture the aspects which will have a lasting impact on Wikipedia and are essential to the mission of Wikimedia. Some of the issues overlap with the goals of the PPI are:
- Use of Wikipedia as a university classroom teaching tool
- Contributor demographics
- Recruitment and retention of new contributors in Wikipedia
- Recruitment and retention of expert contributors in Wikipedia
- Consistency and analytics of article quality assessment
- Implementation of a new reader feedback tool for user assessment of articles
WikiProject: United States Public Policy & Article Quality Metric
The WikiProject United States Public Policy (WP:USPP) will be in an essential tool for several of the analyses. At the beginning of the project in July 2010 the PPI team worked with the Wikipedia community to add a quantitative element to the existing article quality assessment metric. This metric will be used to measure article quality improvement over the course of the project.
The whole evaluation plan consists of several smaller analyses, that hopefully when combined will provide a whole picture of the project's success. There are several analysis layers to the evaluation plan of the Public Policy Initiative:
How accurate is the article quality metric?
This analysis is fully mapped out. Now that classes have started we are recruiting some Wikipedians and Public Policy subject matter experts to help assess the articles and test drive the metric. The experiment is made for 10 reviewers, but it would much more powerful and likely to have significant results if 20 people participated in this assessment.
Part of the purpose of the article quality metric is to provide an assessment of article quality that is quantifiable. The metric creates a numeric score, with the considerable weighting on content and sourcing. There are thresholds built into the metric code requiring minimum scores in certain assessment areas to further ensure that articles must meet certain content and neutrality requirements to earn specific assessment rankings. This particular analysis will look at variation within the article quality assessment metric:
- Do different types of evaluators, subject area experts or Wikipedia experts, produce statistically different article assessments?
- What is the point range of variation for an article assessment? (How good is our tool?)
- How closely did the average rating assigned to an article during the testing process align with the article's current score?
Sources of variation
- time lag (assumption: initial assessment of metric will have more variation than subsequent metric assessments.)
- consistency of article assessment will improve with time as the metric is more widely used
- consistency of evaluators as individuals will improve over time as they get used to using the metric
- variation of quality between different articles
- variation between types of evaluators: Wikipedia experts and subject matter experts
- variation between individual evaluators: some people will grade harder/easier than others
- Primary test variables
- variation in article score between evaluators
- variation in article score within the same evaluator
- Additional variables
- type of evaluator
- number of edits between the two draw dates
- I just want to emphasize that the purpose of this evaluation in not to gauge variability in article quality, but to look at the metric itself. How consistent is this assessment tool? and Is there a difference in scores between subject area expert assessment and Wikipedian article assessment?
- First, a power analysis was necessary to determine the required sample size for testing the article quality metric. (Power analyses are a bit of a catch 22, because the power analysis requires the analyst to enter an approximate number for variation to generate the sample size parameters.)
- To get an approximate variation in article quality score, LiAnna and I read each others' and Sage's articles from the assessment test. The result was a average standard deviation in score = 5.
- I used Russ Length's power analysis page and ran the 2 variances (F test) applet and entered the following parameters: Var1 > Var2, equal ns, alpha = 0.05, variance 1 = 25 (variance is standard deviation squared), variance 2 = 16 (I assumed that the variation would decrease slightly as reviewers gained experience assessing articles), I adjusted the sample size (n1 and n2) until the power, or significance, equaled 0.9, which is the standard for social data. The required sample size for the two populations was 173 each. The two populations here are the same articles drawn on different dates, so population 1 will be 25 randomly selected public policy articles from date 1 and population 2 will be the same 25 articles from date 2. There are three major aspects of this analysis that will reduce variation, and this standard power analysis was unable to account for them. The first is that the articles in the two populations are the same, this will have a huge effect, and second, each article will be re-reviewed by the same reviewer at least once. Thirdly, all assessments will be performed by a limited number of article reviewers. These three implications of the study design will hopefully reduce variation enough to produce significant results with a much smaller sample size. However, since it is not possible to perform 346 article assessments for this analysis, the final results may not be significant due to insufficient sample size or because there is not significant difference between the scores of the two populations.
- The final sampling plan is as follows:
- The two populations consist of one set of 25 randomly selected articles drawn from two dates.
- 10 reviewers, some Wikipedia experts and some subject matter experts, will review a randomly designated sample of articles at the early date. At the later date, the reviewers will assess another group of articles from the second date and they will review 2 or 3 articles that they assessed previously.
- The 25 articles were selected from the population of 260 public policy articles that existed within the WikiProject: U.S. Public Policy on 13 July 2010. The articles were selected using an online random number generator.
- 5 reviewers will assess 8 articles from the first date, 6 new articles from the second date, and re-review 3 articles from the second date that they had previously assessed on the first date.
- 5 reviewers will assess 6 articles from the first date, 5 new articles from the second date, and re-review 2 articles from the second date that they had previously assessed on the first date.
- Each article will be assessed a total of 6 times, 5 assessments will be by unique reviewers, and each article will be re-reviewed one time by one reviewer. This creates a combined population sample size of 150.
An F test looks at whether or not two populations have the same variance or standard deviation. For this metric study, the variation between scores for the same article is important, but differences in the mean mean scores between articles are not important. This statistical test will describe the degree of variability within the article assessment metric.
Public Policy Content Improvement & Article Quality Assessment
How do we determine whether or not and to what extent the content of the public policy articles improved as a results of the Public Policy Initiative?
Class Targeted Article Quality Evaluation
How many articles were directly improved by the PPI?
Did the articles targeted by university classes improve?
This analysis is partially planned and waiting for a list of all the articles directly worked on through the university participation in the project. The analysis will consist of all university targeted articles from two dates, one at the beginning and one at the end of the project. The analysis will most likely be a paired t test.
Ripple Effect on Article Quality
Did the PPI have an impact on articles related to those targeted by the universities?
Did new users recruited through the PPI become editors of other subject areas?
- From the articles targeted by the universities we will generate a list of all articles that link to and from the original university targeted articles. A random sample of the linked articles will be assessed for quality improvement.
- If possible, it would also be nice to see if new users began to contribute in other subject areas.
- Ideas will be solicited ideas for how to perform this analysis.
Impact on New Articles
Did the PPI have an impact on creation of new articles?
How many new articles were created as a result of the PPI?
Is this number significantly different from the average number of new articles created?
Ideas will be solicited about how to perform this analysis.
Article Feedback Tool
How did Wikipedia readers rate the public policy articles?
How do results from the reader feedback tool compare with the article quality assessment metric results?
Analysis using this tool will be determined once it is implemented.
Basic Descriptive Statistics for the Project
Several basic descriptive statistics will be compared at the beginning of the project and the end. This information will be derived through WP:USPP. These statistics will comprise the initial baseline data that will be provided to the Advisory Board in early August.
- Number of contributing universities, professors, students, ambassadors
- Number of Public Policy Article captured within WP:USPP
- Number and a list of Public Policy Categories
- Number and percentage of assessed articles
- Breakdown of the assessment
- Number of unique contributors of Public Policy Articles
Categorization of Public Policy Articles
What effect did the PPI have on categorization of public policy articles?
How do we show the changes that arose in public policy categorization as a result of the PPI?
Ideas will be solicited on how to perform this analysis and how to describe categorization within Wikipedia.
How successful was the PPI at building an infrastructure to support use of Wikipedia as a university classroom tool?
How successful was the ambassador program?
What issues do we need to address to improve quality/mobility of the program?
In December 2010 and May 2011, the team will perform focus groups with students, campus & online ambassadors, and professors to describe specific success and failings of the project and address issues that may affect the project's sustainability in the future.
How successful was the curriculum that was developed through this project?
Was the project successful at recruiting and retaining new users?
WestED will perform an independent analysis of this portion of the project.