Donnerstag, 29. Mai 2014

Google Scholar hit counts apparently imprecise above 25k

In an effort to appropriately position my forthcoming PhD work, I have (once more) taken the liberty to look at related keyword popularities as measured by Google Scholar hits. If done systematically, however, the findings are actually not too encouraging for the more popular fields and terms. Judging from the results I obtained, it would seem that hit counts beyond the 25k mark range from "rough imprecision" to "total guesswork". But let's look at my (admittedly simplistic) method:
  1. Type in a search term (e.g. "knowledge work" - excluding patents and citations)
  2. Limit results to "since 2014", fill into "raw count" spreadsheet cell
  3. Repeat 2. for "since 2013" and "since 2010"
  4. Compute "2014 (extrapolated)" via "since 2014 (raw)" * 5 / 12 (being the end of May)
  5. Compute "2013" via "since 2014 (raw)" - "since 2013 (raw)" (straightforward & robust to 2014 indexing lag)
  6. Compute "2012/11/10 (avg.)" via  ( "since 2010 (raw)" - "since 2013 (raw)" ) / 3 (also straightforward & robust)
Should work fine, shouldn't it? Unfortunately it doesn't though - at least not if the reported hit counts exceed ~25,000 (see below / click to zoom).

Assuming no seasonality and no indexing lag, "macroergonomics" and "service engineering" are evidently stable, "knowledge work" trends slightly down, "work productivity" up - and yes, assumptions, goodness-of-fit and other objections are noted, evidently it's merely a rough yardstick. However, what these objections cannot explain are the curious results of my big-number benchmark terms "ergonomics", "computer" and "algorithm", whose hit count deltas fluctuate rather wildly in individual years prior to 2014.

How very strange... If it had been 2014 I would have blamed an (expected) indexing lag, but for this effect I'm stumped for causal interpretations. The only explanation I can offer is "Google guesswork" (read: very rough approximation) when it comes to hit counts, or inadequate data range (or content) filters. So far, googling the phenomenon did not seem to yield usable information, but I'll definitely dig around a little more and update this post as appropriate. Now out to the world with you!