My Guardian/Apple Time Capsule estimate
The Guardian Data Blog today asked a guess-how-many-sweets-in-the-jar-esque question: how many Time Capsules did Apple sell in a 5-month window in 2008?
The background to this bizarre competition is a limited recall of Time Capsules - the recall providing both a range of serial numbers and a time-frame during which units bearing these serial numbers were manufactured.
Additionally a user-generated website, the Time Capsule Memorial Register (TCMR), captured the serial numbers of around 2,500 units from users who logged their problems with the site. Analysis of the serial numbers from TCMR should provide a crib to understand the structure of the 11-digit serial number and from this have a stab at the number of units manufactured in the given window.
So, with the understanding that this is just a bit of fun and I'm rather rusty on my stats (I don't want to embarrass my former lecturers in the department of Physics at the University of Nottingham)...
32,00031,000 units!
And now for my reasoning. I first loaded all the serial numbers from TCMR into a database with a column for each of the 11 characters. A simple query gave me a count of the number of times each character was used in each field (this result available on Google docs).
The serial number contains digits 0-9 and letters A-Z, yet only the 7th and 8th characters in the serial number contain relatively even distributions of characters. I've highlighted that letters 'O' and 'I' don't appear to be used, which makes sense because they can easily be confused with zero and one respectively. A handful do appear in the data set but I've put that down to human error reading or inputting the data.
My first assumption is that these 2 characters at position 7 and 8 form a counter using base 34 positional notation. The maximum value that can be represented in this way is (34 x 34)-1 = 1157.
There's almost certainly a checksum contained in the serial number. Companies use checksums to ensure that customers haven't made a mistake when entering, and also as a first-level of validation. The uneven character distribution in character 11 is consistent with a checksum, given the structured nature of the first 10 characters making certain values of check-character more likely than others, and making some values impossible.
Whether characters 9 and 10 also form part of the checksum or whether they embed other nuggets of information is unknown, but I'm guessing from the character distribution that these 2 are unlikely to form part of the sequence counter.
The 1st and 2nd characters are fixed - always 6F. This is likely a product code. And we can assume from the Apple support publication that the 3rd, 4th and 5th together form a batch number of some description, since manufacturing of most consumer electronics like this tends to be done in batches rather than round the clock on a dedicated plant.
And now to my somewhat rusty maths and digit 6. I've concluded that this together with digits 7 and 8 form part of the counter, even though no character higher than a D appears. (There are 7 instances of 'S' which appears to be an anomaly I've put down to human error confusing the number 5 with the letter 'S') The character distribution is most interesting - it approximates to an exponential decay with 0 occurring in 34% of the sample, 1 in 21%, 2 in 11% and so on until C appears in only 0.45% of the sample and D in a mere 0.05%.
My guess is that some batches (or runs) of units were bigger than others. Again, this makes sense since only after the product hits the market will the company get an accurate idea of the demand. Demand will dictate the size of the next factory run, and so on.
From the distribution of the 6th character - the most-significant "digit" in the 3-digit base-34 counter - it's possible to calculate an "average" batch size of approximately 3,900 units.
To recap: for the challenge we're considering serial numbers 6F807NNNXXX to 6F814NNNXXX, where N represents a base-34 digit and we're probably not interested in the Xs.
Because it appears as though we're only dealing with 8 batches there's going to be a reasonably high degree of uncertainty because of my use of the average batch size of 3,900 - it's possible or indeed likely that batch sizes will be part of a longer-term trend.
I did contemplate going back through the data set and looking specifically at the distribution for the batches we're interested in...
But there's a rumour going round the office that I've got a day job to do as well!