The role of English Wikipedia’s top content creators in perpetuating gender bias

Clarification: Account and content creator are used interchangeably as some content creators are bots who are operated by human editors. Thus, the same content creator may appear more than once. The total number of human content creators is likely to be less than 5,000. article creators when compared to the rest of the community, 8 January 2014.svg/0,

Top article creators on Wikipedia when compared to the rest of the community, 8 January 2014. A graph by Ktr101.

According to Ktr101, the top 5,000 article creators on English Wikipedia have created 60% of all articles on the project.  The top 1,000 article creators account for 42% of all Wikipedia articles alone.

Wikipedia has a well known gendergap when it comes to articles about women.

Ktr101 made the connection between the two issues in their piece for the Signpost, saying:

” With the already low numbers of females on the site, this means that there will be more coverage of male-oriented topics. If an article is not covered immediately, there is a good chance that it will be created in the coming years. Unfortunately, this means that whatever female-oriented topics are out there will probably get further neglected, as there is less of a chance that someone will even know that the subject exists, never mind it being notable enough for an article (when in doubt, go for it). The amount of these super page creators only exacerbates the problem, as it means that the users who are mass-creating pages are probably not doing neglected topics, and this tilts our coverage disproportionately towards male-oriented topics.”

This does bring up the question: How bad is the genderap in terms of article creation by Wikipedia’s top content creators?  Are “super users” exacerbating the problem by overwhelming creating new articles at males and not creating large numbers of articles about women?

The easy answer to that question is to get the percentage breakdown by gender for of all of English Wikipedia’s top 5,000 editors.  This is easier said then done for a number of reasons.  The first is the ability to easily label articles as male, female and neutral.  Some of this will be inherently subjective.  Some of it might actually require content analysis, because an article about say “Netball in Jamaica” could have been primarily written by some one interested in the men’s game despite the sport being historically female.  In that case, the article could be turned on its head and simple female coding for female could be wrong.  If just doing it from a list, it requires a lot of knowledge about names and verifying gender facts.  Lindsay is one of those unisex names that can be male or female.  If a person is writing mostly about Australians, the name is probably going to be male.  If a person is writing about USAians, then it will probably be female.  Again, cultural knowledge or authentication by viewing the article is needed.  Then there is the purely subjective stuff: Should “Sex and the City” be female, should “Futurama” be male or should “West Wing” be gender neutral?  Should the Abbott Ministry be male because Tony Abbott is male and most of the ministry is male (and some policies are seen as anti-female) or should it be gender neutral because women are on it and a ministry is not inherently a sexed concept? Such coding is inherently problematic and makes potential replication very difficult, especially since we are not looking at a few articles but thousands of unique articles.  Any research realistically may not be replicable.

Despite that, the question is still worth answering and worth considering.  I wanted to do this, but given the time constraints because of some of the coding issues mentioned above, I was only able to examine the contributions of 20 of the top 5,000 contributors.  This sample size represents only 0.4% of all people on that list.  To give an idea as to the top 5,000 article creators on the list, the mode number of articles created was 101, the median was 108 and the average was 4,009.  Across all 5,000 article creators, this is not quite a match.  For the 5,000 the mode was 107, median was 205 and the average was 536.  The quartiles for the sampled population are 101, 108, 1315.5, and 40016.  For all 5,000 contributors, they are 135, 205, 400.5 and 94756.  In all, 80,196 articles were included in this sample.  Not an exact representative sample but for my purposes of trying to begin to understand patterns and hoping to encourage others to continue this research, it is good enough.

For my purposes, women’s articles are defined as biographies about women, articles about groups of women, things heavily featuring women, articles about fictional women, or articles that almost entirely discuss only women. Example: Hillary Clinton, Canberra Capitals, The Good Wife, Lisa Simpson, African American women in politics. The same applied for articles about men.  Neutral gender articles were articles that did not fit into these categories.

Using this criteria, 1412 articles were identified as female, 4595 were identified as male, and 74189 were identified as gender neutral.  On the face of it, woot woot.  Ignoring the gender-neutral articles, 23.5% of all articles were about women.  This certainly beats the estimated contributor genderap.  Except the data suggests this is factually no true in terms of “super users” creating articles about women. Of those in the sample, 5 people did not write an article that was gendered either way.  Four people wrote zero articles about women but did write articles about men.  That puts it at 45% of the sampled contributors not writing about women (and men), and of the people writing a gendered article, 26% of them not writing about women.   This is where a bigger sample size would probably come in handy, but it is still a bit depressing.

When looking at gendered article writers only for only their gendered content, only one contributor was at 50% of their articles being about women.  The next closest created 38% of their articles about women.  The third was at 25%.  The fourth most popular was 19% and the fifth was 13%.  That rounds out the top 25% of creators of content about women.  The remaining 75% (including our non-gendered writers) average 2.6% of their content about women.  The remaining 75% writing about gendered topics write 4.7% of their content about women.

English Wikipedia’s “super users” are not contributing much female content.  This is problematic on multiple levels.  The first is ROI.  A lot of money is currently being spent on encouraging new contributors to come to the project and write articles about women.  There are editathons and training sessions and wikistormings.  All of these cost in terms of volunteer hours and money.   Research shows that edit-a-thons are not actually very cost productive in terms of generating new content and developing a new cohort of users.  A lot of times, articles developed at these events get deleted or nominated for deletion within seconds of going live.  The return on investment is very high to create a cohort of new users to fix the gap.

That isn’t to say that women should not be recruited and should not be encouraged to add articles about women to Wikipedia.  They absolutely should. On some level, the more this editing is normalized, the better.

It just is not a cost and time effective solution to fixing the representation gap for women on Wikipedia.  The best option is to encourage the top 5,000 editors to create articles about women and to incentivize this group. The sheer volume of articles they have created indicates they have a good understanding of what makes a person or topic notable for the purposes of being eligible for an article.  They do not need to learn the interface because they probably mastered it on their way to creating these articles.  They have accumulated reputation that for a number of them makes their articles much less likely to be deleted.  The group is clearly passionate about Wikipedia, enough to create a large number of articles.  The costs to get them to switch over to creating content about women is probably much lower.

The second problem, once ROI is out of the way, is one Ktr101 alludes to: If top content creators continue with their current contribution patterns, the under representation of women is likely to get worse, not better.  If one assumes a new article creation rate of only 0.1% (including non-gendered) or 8.9% (excluding non-gendered) articles are about women, it means that the remaining non-“super users” who have only created 40% of the existing articles need to fill the gap. And existing research on Wikipedia editor recruitment and retention suggests this is just not a feasible solution.  Despite all the efforts to recruit and retain editors, it just isn’t happening.  More and more articles are being created by “super users” and there is no growth pattern that suggests this option of relying on new users is not feasible.

The third issue is relying almost exclusively on new contributors to create new content women as a way of offsetting the gender imbalance does nothing to address perception problems related to Wikipedia being male and cliquey. Using business jargon, Wikimedia Foundation provides a service: free knowledge for public consumption.  The service has stakeholders, a key group of which are the elite content creators.  The “super users” in this elite content creating group provide 60% of Wikipedia’s content.  They provide most of the material for public consumption for another one of Wikimedia’s key stakeholders which are colloquially known as readers.  In this area, the two groups of key Wikimedia stakeholders are actually acting counter to the goal of the Foundation because one group is actively not providing information that another wants.  Worse yet, because of behaviors by one group (or at least the perception of their behaviors), it hurts the ability of the Wikimedia Foundation to grow readers and to grow another stakeholder group, regular and new-contributors.  One of the ways to offset this gender imbalance that creates this perception problem and lack of information problem is to change not reader desires but the behavior of the super users who are perceived as “being” Wikipedia.  And after these super users create the articles about women, highlight them and talk up their work.

English Wikipedia’s top content creators play a role in perpetuating gender bias on the project, and steps should be taken to do more research on the project and to understand the implications of what this means in a broader gender gap perspective.



  1. Interesting trivia to munch, but little nutritional value IMHO. Everyone in every field always complains that there are too many articles about pokémon and French municipalities, too few about X very important topic/aspect. All such considerations are worthless because they don’t consider the actual relative *impact* of those articles, e.g. how many other people edited or discussed them after creation and how many page views they had.

    One could even argue that the “perception problems related to Wikipedia being male and cliquey” is made worse by posts like this, 😉 but that would be trolling. I don’t see how the mere number of articles on one topic or another is going to make Wikipedia feel “too male” or “anti-female” to a normal user who certainly doesn’t notice such things. If you have to dig for it, it doesn’t contribute to public perception; at most it can be a possible symptom of something else that may be contributing to public perception.

    Even disregarding the impact, to assess the bias of the contributors themselves a more precise research, comparative in nature, would be needed. For instance, if one writes articles on parliament members in country X, and 70 % of articles are about males, that’s only biased if the actual percentage of male MP is less than 70 %. The same should be done with all the sources for each topic.

    And it’s nothing compared to the systemic bias towards the western and anglo-saxon point of view which writing in English and using (mostly online) English sources encourages, let alone languages less global in nature.

  2. This is actually a fair point. I discussed it via e-mail with another person. Just because 10% of articles about politicans are female does not mean there is a gender gap about politicians. (In fact, it could be the inverse to a degree. 95% of the politicians could be male, people could have made special efforts to create articles about female politicians and they are now over represented.) There is just not a good way to assess the lack on a large scale to make useful large scale analysis.

    That said, there is a lot of growth for women’s related content and the problems of the lack of women are visible and highly known. Many women I know personally cite this, the lack of editing of these articles, attempts to create articles about women and having them deleted as reasons not to edit. There is no simple solution, but by encouraging the most active contributors to write more, it could possibly change the perception by changing the focus and putting it on a narrow group instead of a huge group of casual editors where the ROI on engagement and perception of change would be slower and lower.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s