Horizon scandal
I just watched the ITV drama, Mr Bates vs The Post Office. It is an excellent story of the tragic persecution of sub-postmasters for the failings of the then-new Horizon platform and the doggedness by which the Post Office persecuted their sub-postmasters for perceived fraud rather than admit to failings in the Horizon system. With the final triumph of the sub-postmasters and the ongoing political fallout which hopefully will result in all convictions quashed and the affected sub-postmasters rightly compensated.
Much has been said about the corporate failings of the Post Office and Fujitsu. My thoughts were somewhat drawn to how I would approach the issue from a technical point of view. As a technologist how would I look at finding the issues within Horizon?
I have not been associated in any way with the Post Office or Fujitsu and have no insider knowledge of the Horizon platform, but I have worked fairly extensively on merchant acquiring systems at banks and other financial institutions in various locations around the world.
So, if I received customer complaints as outlined in the TV drama how would I break them down? What areas of the system would I investigate? Could I draw any conclusions?
The first and probably the most obvious conclusion is these issues do not affect every transaction. Much was said about there being “no systemic” problem with the system. If 99.99% of the reported 6 million transactions were processed each day, there still would be 6 hundred transactions a day affected. So, when these issues have occurred something has not gone right within the system. We would describe these scenarios as exception handling. The question is now what exceptions could occur to cause these issues. At this point, we need to go into a bit of detective mode.
In one scene the sub-postmistress Pam Stubs describes how the power in her post office was not reliable. This points to networking issues between her systems to the central server. In Mrs. Stub's case I would be looking at how the system within her branch and the central system handle transactions if the connection is lost while a transaction is being processed, is there a way for them to reconcile these transactions when communication is re-established?
Jo Hamilton was depicted talking to a help desk when suddenly her loss is doubled. It’s difficult to know without more information but quite often a doubling of an amount is a result of a duplicate transaction not being properly processed. The handling of duplicates is known as idempotency and can occur programmatically when a request message is re-tried when the client process has not received a response in time or the customer has pressed the enter key twice. The time of day may also indicate Mrs. Hamiliton's system and the post office central system could have been trying to reconcile which could have caused the issue.
Finally, multiple characters talked about issues when a second pin pad was added. Again, without knowing the details of the system it is very difficult to know but my lines of investigation with this would be around what happens when both terminals are operating at the same time. Does one terminal lock the account branch system for the duration of its transaction causing the second terminal's transaction to be missed? Could a response intended for one terminal be sent to the second?
So, for me, the underlying technical issues appear to be a lack of exception handling. I was taught that 20% of the effort of development should be in “happy path” processing and 80% should be spent on exception handling. There are arguments that this can be reduced with modern frameworks with the magical figure of 80% or even 100% code coverage banded around. I am a bit wary of this as it covers code that is written, i.e. tests around the code which should have been written, are left out. Exception handling can quite often fall into that category and so the default processing is left such as automatic re-tries which as discussed may well not be appropriate for payments systems.
I have heard leaders proudly state they have 100% code coverage. I always think what could have been omitted to achieve this result.
We are all humans who hopefully take pride in our work and tend to defend the systems that we have developed sometimes for years but we must not be afraid to reconsider our past work based on the information now at hand.
The Horizon scandal now stands as the biggest miscarriage in British legal history with thousands of people's lives ruined, with costs in legal fees and compensation running into billions. The sad fact is this could have been so easily avoided;
- If people with the appropriate experience and skills designed and built the platform
- Exception handling was in-grained into the system
- When users raised issues, they were taken seriously and corrected
- Maintain in-house capability of technically assessing outsourced systems
Again, I do not have any knowledge of the Horizon platform outside of what was presented on the ITV drama but as IT professionals and leaders I thought it would be good to reflect on the scandal from a technical point of view rather than a legal standpoint.
I would be really interested to know your thoughts on this.