'Working' Software is an ambiguous term. A prototype may well 'work'. Technology-proving software by definition has to be 'working' in order to prove the technology, but it need not be complete, and it certainly need not be reliable.
However The Management think of 'working software' as usable software that we can give to astronomers to play with. So this page outlines what is involved in delivering software that end-users can operate without giving up in disgust and never touching it again.
The original article was [here] but the site is unreliable. The discussion forum is/was [here]
The best explanation I've heard so far is that the "software behaviour causes the least amount of surprise" . This isn't all that measurable however without, say, an eyebrow monitor on every user.
Where we can expect surprises to occur, particularly alarms [where the system flags a problem to the user such as data being offline temporarily, or a illegal query syntax] and errors [where the system itself fails somewhere], the messages must be as clear as possible, and ensure that the user can get back to work with the minimum effort.
Finally, for a system to be usable, particularly by new users, it must be reasonably reliable.
Use Reliability essentially means that when the user submits a task, he can expect it to behave in the same way as the last time he submitted that task. When it is a new task, he expects it to complete, or return with an 'alarm' indicating that the user has incorrectly submitted the task and something useful about where the problem lies.
Component Reliability is sometimes measured as mean time between failures (MTBF) or mean operations between failures (MOBF), however these are difficult to measure in this case because the operations tend to be different. A better measure might be availability, particularly mean time to recovery. MTBF must be long compared to atomic tasks; ie for tasks that must be restarted from the beginning.
System Reliability is as above where you have a collection of components. As Guy pointed out many moons ago, systems that depend on long chains of operations can be very unreliable as you multiply up the chance of failure at each operation.
A stable system is one that doesn't change. Stable systems are often seen as reliable, because users get used to its 'quirks' and so work around them. Once they have done so, the tasks they submit include these workarounds, and so the task can expect to complete. In contrast, even extremely reliable systems that are unstable (eg those with continuous upgrades) tend to be unreliable to a user, as quirks move around the system at each upgrade, interfering with the expected outcome of a task. New features that require even slight changes to tasks will also make the system appear unreliable to the user; a task that used to complete might suddenly fail because an interface has changed. We can upgrade systems and yet keep some stability by preserving interfaces.
In summary, stable systems can be reliable (but may not be), but in practice unstable systems will always be 'unreliable'. The VO will be unstable; there will always be services appearing and disappearing. To cope with this and provide some 'use reliability', it must be robust.
A system that is robust is capable of completing a task correctly even when components fail. Thus, if a registry is not available to resolve a resource, another registry is queried. If an application is unavailable, or overloaded, or reports an error (rather than an alarm...), the system submits the task to a different one.
When the task cannot be completed, then relevent, accurate and useful errors and alarms must be reported to the user, so that the user can make appropriate robustness decisions.
Similarly, where a task does not complete, if status and results-so-far have been recorded the task does not have to be restarted from scratch.
[Operation Recovery? failbacks, methods and requirements]
Availability is usually given as the percentage of time that the component is 'up'.
Another useful measure is 'Recovery time'; ie how long it takes from knowing that a component has become unavailable to making it available again. In some cases this might be a simple operation - eg restarting the application or tomcat - but in others it may involve recovering lost data, re-indexing, etc. For interactive tasks this must be low...
Response time. As these are not safety or mission critical systems, most of our target response times are 'soft', that is to say it doesn't matter if a service fails sometimes to deliver within the time limit.
Monitoring programs however may have certain 'hard' limit requirements; if a service does not respond with status information within a certain time limit then the service will be seen to have failed, and an alarm generated.
Like most other optimisation issues, set performance requirements should only be assigned when we know where the bottlenecks are. Usability might require certain peformance requirements for operations available from UIs, but these are also soft.
Where one service brings down other services. This may happen when:
1) A service depends on the failed service. So for example a service that is configured to work with only one registry which has failed.
2) The environment is broken; eg when an Out Of Memory error from one webapp under tomcat happens, it causes all the other web applications to become unavailable. Similarly disk space, etc.
We can now configure registry resolvers to fallback on a second registry if the first cannot be found.
[Need to add more here; a single failback is not sufficient unless we can say that registries will be as reliable as Domain Name Servers. And as immovable; otherwise if a registry has to move, all the configurations everywhere will have to change too. We probably need some mini-self-configuration by clients, that ask a registry for any other registries that it knows about, and keeps them locally just in case.]
Also, though I know many will disagree with me here, to make the system fully robust, we should allow identifiers that do not need to be resolved with special components, thus allowing us to bypass broken components altogether if need be.
For now, where possible, we install one web application per Tomcat per Server. This will
1) Help catch the fault
2) Decouple failures.
Add round robin application selection from JES; not satisfactory but better than always picking the first.
We could do with an application (webapp or swing) that polls the services found in the registry and reports whether they are active on a single screen. Perhaps a simpler similar web service could monitor the various tomcats and report via email when they no longer return their root page.
Errors from JES need to be clearer. Can we make the messages more useful (many of these are context dependent), present them without stack trace (but with stack trace available). Finally can we make use of standard HTTP error codes as well.
We must ensure that once reliable components are installed, they are not disturbed. We can and should include late-test components in the VO, but we should ensure that these do not disturb existing access to existing components.
If reliable components are to be left in situ, then revised versions of those components must appear at different endpoints. This idea was accepted in principle at the cycle-1 planning-meetings. This means that for one registration of a science resource (e.g. a DB or a CEA-served application) we may have more than one registered service. We shall need a way for JES to distinguish these in the registry and to choose among them. In particular, we need a way to distinguish different versions of the interface contract for a resource.
The evolving contracts project now mandates a URN (in the urn:astrogrid:contract: space) for each contract version. I suggest that we find a way to include this in the ceaService schema.
Ultimately, IVOA also needs to deal with this issue. The resource schemata for, e.g., SIAP need to support versioning of the resource.
-- MartinHill - 02 Mar 2005 -- GuyRixon - 03 Mar 2005