Metrics, predictions and stuff: metrics

Posts mit dem Label metrics werden angezeigt. Alle Posts anzeigen

Samstag, 27. September 2014

Productivity management of case-based knowledge work - a possible home in the emerging "Information Ergonomics"

So last week I went to the i-Know 2014 in Graz (Austria) for the first time ever. It featured a workshop on "Information Ergonomics", whose description suggested it to be

very interesting
fairly novel
an adequate forum to present my forthcoming PhD dissertation on knowledge work productivity (which still appears to be fairly unique in the present research environment)

And indeed, I was not disappointed - in an incredibly productive half-day session, we presented our various takes on what "Information Ergonomics" could be, what makes it unique and where it could be placed on a future research agenda. See our draft flip-chart notes below (click to enlarge).

Unfortunately, due to my "workshop only" contribution to the conference, my final paper didn't make it to the organisers' central repository ahead of the event - which is why the workshop web page is more scarce about the details of my paper than it is about others.

To remediate this unfortunate situation, I hereby present you my full paper below:

A predictive analytics approach to derive standard times for productivity management of case-based knowledge work: a proof-of-concept study in claims examination of intellectual property rights (IPRs)

There'll be more about it in my dissertation before the end of the year, for now enjoy what you've got and stay tuned. There should also be a joint "Information Ergonomics" paper before the next i-Know. Interesting times...

P.S.: Just for the record - as the "Design Thinking" keynote @ i-Know suggested (and Wikipedia confirms), our 2005 creation of the "collaborative Advanced Design Project (cADP)" (documented in our 2007 ICED paper) is a very early adoption of the 2004 "d.school" approach to teaching - and it's still going strong! Kudos to my co-authors!

Donnerstag, 29. Mai 2014

Google Scholar hit counts apparently imprecise above 25k

In an effort to appropriately position my forthcoming PhD work, I have (once more) taken the liberty to look at related keyword popularities as measured by Google Scholar hits. If done systematically, however, the findings are actually not too encouraging for the more popular fields and terms. Judging from the results I obtained, it would seem that hit counts beyond the 25k mark range from "rough imprecision" to "total guesswork". But let's look at my (admittedly simplistic) method:

Type in a search term (e.g. "knowledge work" - excluding patents and citations)
Limit results to "since 2014", fill into "raw count" spreadsheet cell
Repeat 2. for "since 2013" and "since 2010"
Compute "2014 (extrapolated)" via "since 2014 (raw)" * 5 / 12 (being the end of May)
Compute "2013" via "since 2014 (raw)" - "since 2013 (raw)" (straightforward & robust to 2014 indexing lag)
Compute "2012/11/10 (avg.)" via ( "since 2010 (raw)" - "since 2013 (raw)" ) / 3 (also straightforward & robust)

Should work fine, shouldn't it? Unfortunately it doesn't though - at least not if the reported hit counts exceed ~25,000 (see below / click to zoom).

Assuming no seasonality and no indexing lag, "macroergonomics" and "service engineering" are evidently stable, "knowledge work" trends slightly down, "work productivity" up - and yes, assumptions, goodness-of-fit and other objections are noted, evidently it's merely a rough yardstick. However, what these objections cannot explain are the curious results of my big-number benchmark terms "ergonomics", "computer" and "algorithm", whose hit count deltas fluctuate rather wildly in individual years prior to 2014.

How very strange... If it had been 2014 I would have blamed an (expected) indexing lag, but for this effect I'm stumped for causal interpretations. The only explanation I can offer is "Google guesswork" (read: very rough approximation) when it comes to hit counts, or inadequate data range (or content) filters. So far, googling the phenomenon did not seem to yield usable information, but I'll definitely dig around a little more and update this post as appropriate. Now out to the world with you!

Samstag, 8. März 2014

Improving file sync & copy duration predictions

File sync and file copy duration predictions suck. At least they do everywhere I have encountered them, which is mostly on Windows systems. I've done my share of *nix and Linux, but it never involved massive copying of data. Every time I switch computers this is the "task du jour" though, and it invariably involves getting annoyed about the unstable and unreliable predictions of how much time there's left to backup and/or verify my data. That's been the case for some 20+ years now, so it's about time things changed.

At least the latest version of FreeFileSync shows traces of improvement - it essentially offers two parallel "burndown" charts, one for the volume of data (raw MBs / GBs / TBs), and one for the raw number of files. I've posted a screenshot of my ongoing "XP netbook backup in preparation of new OS" operation below.

Unlike the official screenshot of the feature, which is unrepresentatively small and linear, a typical sync copy operation involves many gigabytes worth of data, stored in thousands of files of very different sizes. As a result, you will have areas where your typical "bytes per second" volume metric drops because the folder being processed is dominated by tons of tiny files, and the hard disk's seek time becomes the bottleneck. The other extreme are areas with gigabyte-sized files, where transfer just streams through (somewhat depending on fragmentation), and the bottlenecks are read, write and transfer bandwidths.

It's in the former areas that classical "predictors" often forecast days and months of copy delay, disregarding the fact that the "tiny file" count also needs to be burned down at some point, and that storage size / volume throughput doesn't matter too much in these areas.

What does count for the end-user is when the overall operation is finished, which is not when the unstable, throughput-biased predictor thinks it is.

So what can be done? It should be really simple, given that every sync/copy code I know indexes the work ahead of the actual op. So it can know the file size distribution in advance, and can easily relate it to the actuals measured as it goes along. So instead of only looking at the current throughput (bytes per second) and extrapolating its local value until the bitter end, it could:

look at the "files per second" throughput in parallel, extrapolate this as well and average the result - however, that's still locally biased and implicitly assumes that byte throughput and file throughput are equally relevant to the runtime
assess current bytes/second and files/second, and compute a weighted average prediction based on how many files and how many bytes are still left to go - still locally biased, and somewhat dependent on the balance in the remaining population of files, but at least it's a first step of improvement
compute weighted extrapolations based on the entire work done so far, both along the "files" and "bytes" dimension - more complex, but less local than 2.
compute some simple linear regressions: time_left = f(file_size, file_count) - sounds fancy, but it's not really that much more sophisticated than dumb extrapolation!
compute optimal estimates with other fancy algorithms - most fun for the data scientist in me, and promising best predictive value, but it needs a good data basis for model optimisation. Anyone interested in shooting me detailed logs for crunching? :-)

Even the best sync software (e.g. FreeFileSync) doesn't do this well right now - in a job running as I write this, it insists on needing two more days for a verify op that has so far taken 3 hours for 25 % of the volume. Don't know about you, but I'd rather expect to have 9 hours left to wait...

Rant over, let's get this fixed!