LCG-2_6_0 Post Mortem
The tests of the ROCs have been very valuable as the first tests of
the users. This has to be part of the release from now on.
The 3ROCs needed 5 working days to deploy and give feedback. We need
at least one additional week to implement all fixes and
cleanup the release. The final packaging to get the software in a
shape that it can be given to the ROCs for pre-release tests takes
about a week. As a consequence we have to stop integration and
initial testing 3 weeks before the release date.
Problems that have been identified with the 2.6 release
1) The local deployment tests have not been finished before sending
the release candidate to the "3 ROCs" for testing
The result was duplication of work and a potential loss of
confidence by trivial problems surfacing
2) The current test suite(s) (gilberts stress tests and Piotrs SFT)
is (are) not following the evolution of new functions closely enough
a) Gilbert can't know about changes and he is in addition
only working on a "voluntary" basis on the maintenance of the tests
b) Many of the tests still assume the RLS being used
3) We have been bitten again by junk (partial) data in the
information system.
4) It is very hard to close the door for changes, new components, and
fixes in time to get the release integrated and tested.
The reason for this is related to the size of the interval
between releases and the lack of an easy way to provide updates.
5) In addition to an overhaul of the stress tests we need to add
performance tests
6) We released while we had a known bug on the RB which could block
the RBs.
Proposed solutions:
Well sort of.....
3) Not much that we can do here, you can't cover the complete junk
space, obvious errors are filtered by the
BDII.
6) We should define a set of core services on the RB,
BDII, CE .....
before we release all open bugs related to core services have
to be reviewed by another team member and the severity level has to
be adjusted.
About all the rest
We need a test person who is coordinating the test evolution an
maintains the tests. (or individuals for certain areas).
The role of the SFT-2 in certification has to be understood.
It is clear that we wait until tests end before we (pre)release
The information about changes should be collected in the patch
submission step via savannah.
We think that information like this should be provided whenever a
patch introduces changes:
cron-jobs
requirement for GSI infrastructure
host/service cert needs to be registered with
ports
log files
configuration parameters and their meaning
location of conf. files
Where state is kept (db, files)
uses information system for?
depends on other services
changes in usage
suggested tests
............ and ????
The list is certainly incomplete.
For patches that don't change the above list the developer can just
check a box, declaring that there are no changes.
Performance Tests
We'll start with the RB, together with the EIS people (Andrea + ?) we
have to create a workload that reproduces the behavior that has been
observed during the DCs. We can use this then as a standardized
benchmark.
Other tests are probably needed for the data management
We realized that at the root of many problems the long release
intervals could be found.
Plan: Release every 3 month as before, add releases for special
activities and provide updates for the current release.
Current definition for the term "middleware update":
A middleware update is a change of the software that can be applied
to the existing systems without changing the configuration
This can be translated to: Whenever you can get the job done with APT
alone it is an update!
YAIM is not seen as being part of the middleware and bug fixes are
released as soon as available, independent of the need for a change
The big question is who will be the test maintainer?