Main questions we which might need to address during the discussion between SAM team, SUM developers and people who are in charge of VO-specific Nagios tests are
- Machines, how many sets of machines we need, usage/purpose of every set, who is in charge for every set, who can do what and how, how we deploy them
- Test suites for SAM tests, what SUM team can provide in order to partially automate validation
- Validation process itself. How we agree when two weeks suggested by SAM team start (people can be busy with other things and cannot start validation asap)
- How upgrade to the new releases is agreed with the experiments
Comments from Pablo:
First point in the 'Probes development'
We are going in the direction of having 5 sets of machine (production, pre-production, nightly build,
sam development and experiment development). If we have to install the machines by hand, it will be a lot of work.
So the suggestion is to use Quattor templates, and they have to be up-to-date
Staged rollout
the deployments on pre-production should also be announced to the experiments.
The start of the validation process should be agreed with the sam-hep-vo-contacts.
It is not clear if the five days of 'availability and topology
comparison' are part of the validation process or not. It might be good
to divide the validation process in two parts: the first part when SAM team
does their own testing, and after that has been successful, SUM and VO representatives start
with the rest of the validation.
Would be nice to put max delay between the SNOW
request and the installation on the nodes (something like 'less than 8
working hours').
Questions
o) The staged rollout does not include anything about validation by the
experiments. Should it be there?
o) Do all of the SAM components have to be deployed at the same time? I
mean, if instead of deploying the MRS, POEM, APT at the same time, they
could be deployed one after the other, it would make all the comparison
much easier.
o) In the scenario described in the document, preproduction is not that
It is not clear if the five days of 'availability and topology
comparison' are part of the validation process or not. It might be good
to divide the validation process in two parts: the first part when SAM team
does their own testing, and after that has been successful, SUM and VO representatives start
with the rest of the validation.
Would be nice to put max delay between the SNOW
request and the installation on the nodes (something like 'less than 8
working hours').
Questions
o) The staged rollout does not include anything about validation by the
experiments. Should it be there?
o) Do all of the SAM components have to be deployed at the same time? I
mean, if instead of deploying the MRS, POEM, APT at the same time, they
could be deployed one after the other, it would make all the comparison
much easier.
o) In the scenario described in the document, preproduction is not that
important. Is it really needed?
Comments from Jarka
Jarka agreed to attend the SAM standup meetings, max once a week when issues which can have impact on SUM could be discussed
Jarka has scripts which she uses to compare topologies, availabilities etc. She will provide them to SAM team who can integrate them in their tests.
Topology comparison
Jarka strongly disagrees that 10% difference in the topology is not considered to be a showstopper taking into account the fact that site availability
has strong impact on site funding
Availability comparison
The specific profiles with top priority consistency check should be
ALICE_CRITICAL, ATLAS_CRITICAL, CMS_CRITICAL_FULL, LHCb_CRITICAL.
However, the remaining profiles should not be forgotten and their check
should be performed too, also on regular basis. At least once per month
for every profile. As any other check in the past, this could help to
discover bugs in the SAM framework (not having to mention that
experiments are not very happy if a bug is addressed on months timescale).
Deployment
2 days in advance should be fine, but it depends on SAM downtime
duration for the experiment. The deployment should not be only
announced, it should be agreed in advance with experiments, with each of
them separately. In the same way like the EOS/CASTOR interventions are
done...
Jarka would strongly suggest to agree with experiments on date & time & duration
of the intervention before WLCG Daily Ops announcement is done.
Comments from Andrea
Probes development
Having to install an instance for probe development seems a complete waste of time. Testing probes in the PPS worked very well so far, it never created any problem and makes the PPS useful for something. The only situation when this can be a problem is when it is needed to compare the PPS to the PS during the validation of a SAM update. But this can be easily solved by making sure that before the validation starts, the probes are synchronized between PPS and PS.
I am against prescriptions that sound good on paper but create problems without solving any in practice.
I do not understand why I should use a tool called like a Go Nagai character to build the CMS probe rpm. I have a script that works nicely and I see no need to throw it away. I need to be convinced that there is an advantage when using this Koji.
Restricting access to PPS and PS is going to severely limit my ability to troubleshoot problems. For example, I sometimes use it to analyse test outputs from the job output sandboxes. Given that the SAM team cannot provide the effort needed to replace me, and that in all the history of the SAM deployment me having access to the services never created any problem, I strongly advise against this decision, or take it and accept to provide a worse quality of service.
I rather suggest a more reasonable compromise consisting in restricting root access to the PS service but allowing access via a non-privileged account that would allow me at least to read the configuration and temporary files of Nagios, and leaving things as they are today for the PPS.
Staged rollout
I agree with Jarka that topology differences are very serious and a 10% tolerance is too high. A topology difference between PPS and PS means that either a bug was introduced, or a bug was fixed. In the latter case, the existence of a difference is not a bad thing, but in the former 10% is really too much. I would rather propose to sample a smaller set of sites (let's say, 20 sites chosen randomly per VO) to reduce the effort but for these sites there should be no difference in the topology.
- Maarten: those sites should not be random, but rather the 20 most important sites per experiment.
For CMS the most significant profile is CMS_CRITICAL_FULL.
About the priority level of the tickets, it is not OK if the SAM team defines it not in agreement with the LHC VO experts and the SAM developers. Otherwise, the SAM team will always have the power to force an update on the PS without the approval of the LHC VOs. Given that SAM is part of their computing operations, this might have serious repercussions. Besides, the sentence at the end regarding possible objections from the experiments seems to contradict the description of the SR process in this respect.
Summary of discussion can be found here
--
JuliaAndreeva - 25-Oct-2012