Wikipedia talk:Modelling Wikipedia's growth

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Missing word[edit]

In section Relationship of Usenet cites to article growth:

the official article count for the en: Wikipedia appears to

Should that be the following?

the official article count for the English Wikipedia appears to

--Mortense (talk) 00:24, 3 January 2011 (UTC)

Growth trend change in 2006[edit]

Is there an explanation to the growth trend change that happened in 2006 and the seasonal modulation that started that year? Was there an event that led to it? (talk) 14:00, 20 February 2011 (UTC)

Looks more like there was a low-growth condition between 2003 and 2006 when the Wikipedia was relatively unknown and there weren't enough contributors. Since 2006 it looks like the number of remaining articles that people know how to write has been reducing- we're mainly running out of low-hanging fruit (i.e. new articles). (talk) 18:23, 20 February 2011 (UTC)

Sue Gardner's appeal at MetaWiki[edit]

Hi. I suggest including the appeal that contained many recent statistics about Wikipedia in additional to facts about declines in new editor retention. Thanks. ~AH1(TCU) 00:28, 31 March 2011 (UTC)

Looks like the formula may need to be adusted soon[edit]

Since the natural growth is probably closer to the fit than the line (the 2002 bump was just a bot automaking placeholder articles, most of which would've been created anyway) the course the line now wants to take is even more deviant than it seems. — Preceding unsigned comment added by (talk) 06:22, 2 October 2011 (UTC)

Wikipedia growth: Growth in total article text in English Wikipedia, measured in gigabytes (compressed). (Data from en:Wikipedia:Database download)

Wikipedia's growth involves more than the "article count"...[edit]

I don't know if this has been talked about here, but why the almost singular focus upon discussion of the number of wikipedia articles? Maybe that focus appeals to some people because it is easy to measure and think about. And think it means more than it does.

When an individual article expands, that is also wikipedia growth. More articles, with an increasingly inverse relationship to relevance/importance, is not so important.

Again, when articles get better, more accurate, more referenced - that is also wikipedia growth. Mitchitara (talk) 14:42, 16 November 2011 (UTC)

I'd like to see Page View statistics (accessed from the History Tab) officially supported, and a history of total Page View statistics made available. This would show how Wikipedia traffic is growing year on year. It would also be useful to be able to download a list showing at least the most popular and least popular pages. Improving the most popular pages improves the Wikipedia experience for the largest number of people. Conversely, knowing which pages are currently not relevant or useful to a majority of people is also useful information -- they could be flagged as needing improvement, and maybe deleted if they are not improved. LittleBen (talk) 00:55, 5 June 2012 (UTC)

More recent information[edit]

Hi, on this thread on VP a related question was asked. Suggestions will be appreciated. History2007 (talk) 20:28, 4 June 2012 (UTC)

Adjusting the data set to remove the effects of Rambot and the great server slowdown?[edit]

There were two major artifacts that affected the initial growth of Wikipedia. Firstly, there was the great server slowdown, when Wikipedia's initial server effectively ground to a halt for several months. Shortly after that, Rambot dumped a very large number of U.S. place articles into the encyclopedia.

I propose to do the following:

  • removing the samples in the time period involved from the data set for the purposes of fitting
  • bringing sample times forward for the samples before that window by the length of the server slowdown period, to make up for the dead-time where natural processes were stopped
  • substracting the number of Rambot articles from the article count for samples after that window, effectively treating them as a separate static population of articles
  • re-fitting the dataset after these two alterations have been made


Estimates made by eye, from the graphs above:

  • I estimate that the great server slowdown lasted from 2002.25 to 2002.7
  • I estimate that there were about 50,000 Rambot articles, mostly created in the period 2002.7 to 2003.

This gives the following parameters:

  • Time window to be removed: 2002.25 to 2003
  • Time adjustment before window: move later by 0.45 years
  • Article adjustment after window: decrease by 50,000 articles

Fitting to come...

  • Result #1: a pure Gompertz curve without the extra "d" term is now a staggeringly good fit for everything up to mid-2011, but the recent divergence re-emerges
  • Result #2: a tweaked Gompertz curve with the extra "d" term is now a staggeringly good fit for everything from 2004 up to the present, and the mis-fitting at the start, while present, is now much less than before

-- The Anome (talk) 14:47, 12 June 2012 (UTC)

I just checked your tweak of the Gompetz model with the exponential term. I think this is really good. With one extra parameter the recent deviation is compensated really well. The result is a more or less liniar growth of about 20.000 articles a month. We see liniar growth in the German wikipedia for a much longer time (since 2006 or so), so it might be possible to describe the German wikipedia with this model as well. HenkvD (talk) 12:31, 17 June 2012 (UTC)
The modified formula implies that the rate of growth will increase. There is no real evidence of this happening so far. Instead I would try the formula , which takes account of the suggestion above that there is a steady increase in the number of articles that would be considered "notable".
I would not worry too much about treating the Rambot articles specially. I watch the current year's growth rate number at WP:Size_of_Wikipedia, and whenever it increases, there is someone creating a block of stub articles. They might be about species of snail, minute settlements in Poland, rocks in Antarctica, or minor astronomical objects (about 5000 were created last autumn before a new guideline, WP:NASTRO, was created to allow them to be deleted/redicted/merged). Today, user User talk:Dr. Blofeld is creating a few thousand stubs on villages in Turkey to reach 4 million articles. So mass article creation is just part of the process that you are trying to model. JonH (talk) 10:55, 13 July 2012 (UTC)
In general, yes, cascades of mass article creation are part of the phenomenon being modelled, and I have no problem with that at all. However, the Rambot event is special. Firstly, it was a huge perturbation relative to the size of enwiki at the time. Secondly, it took place immediately after a severe server slowdown, which almost completely brough Wikipedia editing to a halt for an extended period. Combined, as two separate large-scale effects not related to any other part of the system, each on a scale seen nowhere else in the entire dataset across all wikis, one affecting time and the other affecting article count, they confound fitting just about any model. I've managed to get some good fits by making ad-hoc patches to the data to "reconstruct" it by making plausible models of what might have happened if neither of these effects were present, but to do so in a real modelling attempt would be unscientific, as it introduces more tweak parameters than it explains. -- The Anome (talk) 00:12, 14 July 2012 (UTC)

Usenet section should go[edit]

The Usenet section is now a whole decade out of date. The era of Usenet is over, anyway; we're long past the point where it was helpful to use in any kind of analysis. I propose that we remove this section. If nobody's objected after a while, I'll do it. — Scott talk 16:49, 27 October 2013 (UTC)

As there's been no objection in the last two months, this is now done. — Scott talk 11:31, 30 December 2013 (UTC)

Request for Updated Graphs/Models[edit]

If anyone has the correct qualifications (I know I don't), these should all be reevaluated, or at least the graphs should be redone. There are lots of really interesting graphs (i.e. the one with edits/page) that haven't been updated in 2-6 years. This shouldn't be too difficult for someone with access to the data and the ability to graph it. It would also allow a little bit more investigation as to whether the latest models (Gompertz/Refined Gompertz) continue to predict wiki's growth. Jacob Cutts (talk) 06:50, 8 September 2014 (UTC)[User:Jacob Cutts|Jacob]] talk 8:45, 8 September 2014 (UTC)

It is 2018 (almost 2019), and I think this article should be updated again. At the moment, it is sitting at 2015. WinterSpw (talk) 19:29, 29 September 2018 (UTC)
January 2018 update (with rough figures)
In January 2018 I made an update (with rough figures), but I stopped updating my graphs, as the graphs seems quite predictable by now. Mabey you can find updated stats on HenkvD (talk) 18:30, 30 September 2018 (UTC)