Freitag, 14. März 2014

Cloning your Windows installation on GPT partition tables

If you think of cloning your Windows installation to a bigger / better harddrive (or even a smaller SSD), you will first of all have to consider the difference between "classical" partition tables (which have been around for ages), and the new, fancy "GPT" partition tables with EFI partitions (which come with the "new BIOS" called UEFI; there are also "OEM partitions", which I hope to figure out as well).

I had read about all this superficially at some point, but mainly as a challenge to Linux, and without paying attention to possible Windows repercussions up-front. Like pretty much everyone else, I could only afford such luxuries as teenager... Anyway, you better google any details about all this if you're interested; this is more about early awareness and documenting related practicalities as I discover them.

There are many free tools allowing you to clone Windows even as it runs, but most won't work well with the shiny new GPT partition structures. And yet, even if you only need to clone classical partitions, you'd better stay away from Miray's HDClone. It does do the job well, but its free version is artificially slowed down to help sell the commercial editions. It used to be my tool of choice before I discovered how miserably it failed with GPT partitions.

Anyway, now I know better, and for cloning disks with classical partition tables I'd use AOMEI Backupper. It's really fast, easy to use, and Chinese - they're very open about the latter, so I suppose we're good on the paranoia front. Still, it won't do too well for GPT-type partitions (EFI, OEM, or even recovery), so on to more experiences.

In order o get these cloned adequately, I ended up paying the euro equivalent of roughly 20 USD for Paragon's Migrate to SSD, a horribly branded, but fairly proficient cloning tool. Downside: It clones only the system partition, the main recovery partition and the OS partition itself. Additional recovery, OEM and primary partitions I had were completely ignored. Note: It apparently allows you to do selective cloning in case of moving a big, well-filled Windows partition over to small SSDs by leaving non-OS data behind, so if that's your usecase it may just the tool you need. In my case trying to fill the "missing partition" gap with diskpart and AOMEI didn't succeed, so I grumbled, went and searched again.

And finally, at long last it seems that I succeeded: the free-for-private-use edition of Macrium Reflect at least copied everything just fine, with a confirmation of successful boot-up still pending at this point. It may have some forgivable UI deficiencies, and it is undeniably slower than AOMEI (120 instead of 90 minutes for 1 TB over USB 3.0),  but at the end of the day it did the job. Two thumbs up! I'll report back in case it doesn't boot... :-)

Samstag, 8. März 2014

Improving file sync & copy duration predictions

File sync and file copy duration predictions suck. At least they do everywhere I have encountered them, which is mostly on Windows systems. I've done my share of *nix and Linux, but it never involved massive copying of data. Every time I switch computers this is the "task du jour" though, and it invariably involves getting annoyed about the unstable and unreliable predictions of how much time there's left to backup and/or verify my data. That's been the case for some 20+ years now, so it's about time things changed.

At least the latest version of FreeFileSync shows traces of improvement - it essentially offers two parallel "burndown" charts, one for the volume of data (raw MBs / GBs / TBs), and one for the raw number of files. I've posted a screenshot of my ongoing "XP netbook backup in preparation of new OS" operation below.


Unlike the official screenshot of the feature, which is unrepresentatively small and linear, a typical sync copy operation involves many gigabytes worth of data, stored in thousands of files of very different sizes. As a result, you will have areas where your typical "bytes per second" volume metric drops because the folder being processed is dominated by tons of tiny files, and the hard disk's seek time becomes the bottleneck. The other extreme are areas with gigabyte-sized files, where transfer just streams through (somewhat depending on fragmentation), and the bottlenecks are read, write and transfer bandwidths.

It's in the former areas that classical "predictors" often forecast days and months of copy delay, disregarding the fact that the "tiny file" count also needs to be burned down at some point, and that storage size / volume throughput doesn't matter too much in these areas.

What does count for the end-user is when the overall operation is finished, which is not when the unstable, throughput-biased predictor thinks it is.

So what can be done? It should be really simple, given that every sync/copy code I know indexes the work ahead of the actual op. So it can know the file size distribution in advance, and can easily relate it to the actuals measured as it goes along. So instead of only looking at the current throughput (bytes per second) and extrapolating its local value until the bitter end, it could:
  1. look at the "files per second" throughput in parallel, extrapolate this as well and average the result  - however, that's still locally biased and implicitly assumes that byte throughput and file throughput are equally relevant to the runtime
  2. assess current bytes/second and files/second, and compute a weighted average prediction based on how many files and how many bytes are still left to go - still locally biased, and somewhat dependent on the balance in the remaining population of files, but at least it's a first step of improvement
  3. compute weighted extrapolations based on the entire work done so far, both along the "files" and "bytes" dimension - more complex, but less local than 2.
  4. compute some simple linear regressions: time_left = f(file_size, file_count) - sounds fancy, but it's not really that much more sophisticated than dumb extrapolation! 
  5. compute optimal estimates with other fancy algorithms - most fun for the data scientist in me, and promising best predictive value, but it needs a good data basis for model optimisation. Anyone interested in shooting me detailed logs for crunching? :-)
Even the best sync software (e.g. FreeFileSync) doesn't do this well right now - in a job running as I write this, it insists on needing two more days for a verify op that has so far taken 3 hours for 25 % of the volume. Don't know about you, but I'd rather expect to have 9 hours left to wait...

Rant over, let's get this fixed!