Metrics, predictions and stuff

Samstag, 27. September 2014

Productivity management of case-based knowledge work - a possible home in the emerging "Information Ergonomics"

So last week I went to the i-Know 2014 in Graz (Austria) for the first time ever. It featured a workshop on "Information Ergonomics", whose description suggested it to be

very interesting
fairly novel
an adequate forum to present my forthcoming PhD dissertation on knowledge work productivity (which still appears to be fairly unique in the present research environment)

And indeed, I was not disappointed - in an incredibly productive half-day session, we presented our various takes on what "Information Ergonomics" could be, what makes it unique and where it could be placed on a future research agenda. See our draft flip-chart notes below (click to enlarge).

Unfortunately, due to my "workshop only" contribution to the conference, my final paper didn't make it to the organisers' central repository ahead of the event - which is why the workshop web page is more scarce about the details of my paper than it is about others.

To remediate this unfortunate situation, I hereby present you my full paper below:

A predictive analytics approach to derive standard times for productivity management of case-based knowledge work: a proof-of-concept study in claims examination of intellectual property rights (IPRs)

There'll be more about it in my dissertation before the end of the year, for now enjoy what you've got and stay tuned. There should also be a joint "Information Ergonomics" paper before the next i-Know. Interesting times...

P.S.: Just for the record - as the "Design Thinking" keynote @ i-Know suggested (and Wikipedia confirms), our 2005 creation of the "collaborative Advanced Design Project (cADP)" (documented in our 2007 ICED paper) is a very early adoption of the 2004 "d.school" approach to teaching - and it's still going strong! Kudos to my co-authors!

Donnerstag, 29. Mai 2014

Google Scholar hit counts apparently imprecise above 25k

In an effort to appropriately position my forthcoming PhD work, I have (once more) taken the liberty to look at related keyword popularities as measured by Google Scholar hits. If done systematically, however, the findings are actually not too encouraging for the more popular fields and terms. Judging from the results I obtained, it would seem that hit counts beyond the 25k mark range from "rough imprecision" to "total guesswork". But let's look at my (admittedly simplistic) method:

Type in a search term (e.g. "knowledge work" - excluding patents and citations)
Limit results to "since 2014", fill into "raw count" spreadsheet cell
Repeat 2. for "since 2013" and "since 2010"
Compute "2014 (extrapolated)" via "since 2014 (raw)" * 5 / 12 (being the end of May)
Compute "2013" via "since 2014 (raw)" - "since 2013 (raw)" (straightforward & robust to 2014 indexing lag)
Compute "2012/11/10 (avg.)" via ( "since 2010 (raw)" - "since 2013 (raw)" ) / 3 (also straightforward & robust)

Should work fine, shouldn't it? Unfortunately it doesn't though - at least not if the reported hit counts exceed ~25,000 (see below / click to zoom).

Assuming no seasonality and no indexing lag, "macroergonomics" and "service engineering" are evidently stable, "knowledge work" trends slightly down, "work productivity" up - and yes, assumptions, goodness-of-fit and other objections are noted, evidently it's merely a rough yardstick. However, what these objections cannot explain are the curious results of my big-number benchmark terms "ergonomics", "computer" and "algorithm", whose hit count deltas fluctuate rather wildly in individual years prior to 2014.

How very strange... If it had been 2014 I would have blamed an (expected) indexing lag, but for this effect I'm stumped for causal interpretations. The only explanation I can offer is "Google guesswork" (read: very rough approximation) when it comes to hit counts, or inadequate data range (or content) filters. So far, googling the phenomenon did not seem to yield usable information, but I'll definitely dig around a little more and update this post as appropriate. Now out to the world with you!

Freitag, 14. März 2014

Cloning your Windows installation on GPT partition tables

If you think of cloning your Windows installation to a bigger / better harddrive (or even a smaller SSD), you will first of all have to consider the difference between "classical" partition tables (which have been around for ages), and the new, fancy "GPT" partition tables with EFI partitions (which come with the "new BIOS" called UEFI; there are also "OEM partitions", which I hope to figure out as well).

I had read about all this superficially at some point, but mainly as a challenge to Linux, and without paying attention to possible Windows repercussions up-front. Like pretty much everyone else, I could only afford such luxuries as teenager... Anyway, you better google any details about all this if you're interested; this is more about early awareness and documenting related practicalities as I discover them.

There are many free tools allowing you to clone Windows even as it runs, but most won't work well with the shiny new GPT partition structures. And yet, even if you only need to clone classical partitions, you'd better stay away from Miray's HDClone. It does do the job well, but its free version is artificially slowed down to help sell the commercial editions. It used to be my tool of choice before I discovered how miserably it failed with GPT partitions.

Anyway, now I know better, and for cloning disks with classical partition tables I'd use AOMEI Backupper. It's really fast, easy to use, and Chinese - they're very open about the latter, so I suppose we're good on the paranoia front. Still, it won't do too well for GPT-type partitions (EFI, OEM, or even recovery), so on to more experiences.

In order o get these cloned adequately, I ended up paying the euro equivalent of roughly 20 USD for Paragon's Migrate to SSD, a horribly branded, but fairly proficient cloning tool. Downside: It clones only the system partition, the main recovery partition and the OS partition itself. Additional recovery, OEM and primary partitions I had were completely ignored. Note: It apparently allows you to do selective cloning in case of moving a big, well-filled Windows partition over to small SSDs by leaving non-OS data behind, so if that's your usecase it may just the tool you need. In my case trying to fill the "missing partition" gap with diskpart and AOMEI didn't succeed, so I grumbled, went and searched again.

And finally, at long last it seems that I succeeded: the free-for-private-use edition of Macrium Reflect at least copied everything just fine, with a confirmation of successful boot-up still pending at this point. It may have some forgivable UI deficiencies, and it is undeniably slower than AOMEI (120 instead of 90 minutes for 1 TB over USB 3.0), but at the end of the day it did the job. Two thumbs up! I'll report back in case it doesn't boot... :-)

Samstag, 8. März 2014

Improving file sync & copy duration predictions

File sync and file copy duration predictions suck. At least they do everywhere I have encountered them, which is mostly on Windows systems. I've done my share of *nix and Linux, but it never involved massive copying of data. Every time I switch computers this is the "task du jour" though, and it invariably involves getting annoyed about the unstable and unreliable predictions of how much time there's left to backup and/or verify my data. That's been the case for some 20+ years now, so it's about time things changed.

At least the latest version of FreeFileSync shows traces of improvement - it essentially offers two parallel "burndown" charts, one for the volume of data (raw MBs / GBs / TBs), and one for the raw number of files. I've posted a screenshot of my ongoing "XP netbook backup in preparation of new OS" operation below.

Unlike the official screenshot of the feature, which is unrepresentatively small and linear, a typical sync copy operation involves many gigabytes worth of data, stored in thousands of files of very different sizes. As a result, you will have areas where your typical "bytes per second" volume metric drops because the folder being processed is dominated by tons of tiny files, and the hard disk's seek time becomes the bottleneck. The other extreme are areas with gigabyte-sized files, where transfer just streams through (somewhat depending on fragmentation), and the bottlenecks are read, write and transfer bandwidths.

It's in the former areas that classical "predictors" often forecast days and months of copy delay, disregarding the fact that the "tiny file" count also needs to be burned down at some point, and that storage size / volume throughput doesn't matter too much in these areas.

What does count for the end-user is when the overall operation is finished, which is not when the unstable, throughput-biased predictor thinks it is.

So what can be done? It should be really simple, given that every sync/copy code I know indexes the work ahead of the actual op. So it can know the file size distribution in advance, and can easily relate it to the actuals measured as it goes along. So instead of only looking at the current throughput (bytes per second) and extrapolating its local value until the bitter end, it could:

look at the "files per second" throughput in parallel, extrapolate this as well and average the result - however, that's still locally biased and implicitly assumes that byte throughput and file throughput are equally relevant to the runtime
assess current bytes/second and files/second, and compute a weighted average prediction based on how many files and how many bytes are still left to go - still locally biased, and somewhat dependent on the balance in the remaining population of files, but at least it's a first step of improvement
compute weighted extrapolations based on the entire work done so far, both along the "files" and "bytes" dimension - more complex, but less local than 2.
compute some simple linear regressions: time_left = f(file_size, file_count) - sounds fancy, but it's not really that much more sophisticated than dumb extrapolation!
compute optimal estimates with other fancy algorithms - most fun for the data scientist in me, and promising best predictive value, but it needs a good data basis for model optimisation. Anyone interested in shooting me detailed logs for crunching? :-)

Even the best sync software (e.g. FreeFileSync) doesn't do this well right now - in a job running as I write this, it insists on needing two more days for a verify op that has so far taken 3 hours for 25 % of the volume. Don't know about you, but I'd rather expect to have 9 hours left to wait...

Rant over, let's get this fixed!