In preparation to the next meeting where we would like to discuss SAM operations, development and deployment aspects and collaboration between SUM developers and developers of the VO tests, David prepared a document.

Here we collect feedback to this document from SUM developers and developers of the VO tests.

Main questions we which might need to address during the discussion between SAM team, SUM developers and people who are in charge of VO-specific Nagios tests are

  • Machines, how many sets of machines we need, usage/purpose of every set, who is in charge for every set, who can do what and how, how we deploy them
  • Test suits for SAM tests, what SUM team can provide in order to partially automatize validation
  • Validation process itself. How we agree when two weeks suggested by SAM team start (people can be busy with other things and can not start validation asap)
  • How upgrade to the new releases is agreed with the experiments

Comments from Pablo:

First point in the 'Probes development'

We are going in the direction of having 5 sets of machine (production, pre-production, nightly build, sam development and experiment development). If we have to install the machines by hand, it will be a lot of work. So the suggestion is to use quattor templates, and they have to be uptodate

Staged rollout

the deployments on pre-production should also be announced to the experiments. The start of the validation process should be agreed with the sam-hep-vo-contacts.

It is not clear if the five days of 'availability and topology comparison' are part of the validation process or not. It might be good to divide the validation process in two parts: the first part when SAM team does their own testing, and after that has been successful, SUM and VO representatives start with the rest of the validation.

Would be nice to put max delay between the SNOW request and the installation on the nodes (something like 'less than 8 working hours').

Questions

o) The staged rollout does not include anything about validation by the experiments. Should it be there?

o) Do all of the SAM components have to be deployed at the same time? I mean, if instead of deploying the MRS, POEM, APT at the same time, they could be deployed one after the other, it would make all the comparison much easier.

o) In the scenario described in the document, preproduction is not that important. Is it really needed?

Comments from Jarka

Jarka agreed to attend the SAM standup meetings, max once a week when issues which can have impact on SUM could be discussed

Jarka has scripts which she uses to compare topologies, availabilities etc. She will provide them to SAM team who can integrate them in their tests.

Topology comparison

Jarka strongly disagrees that 10% difference in the topology is not considered to be a showstopper taking into account the fact that site availability has strong impact on site funding

Availability comparison

The specific profiles with top priority consistency check should be ALICE_CRITICAL, ATLAS_CRITICAL, CMS_CRITICAL_FULL, LHCb_CRITICAL. However, the remaining profiles should not be forgotten and their check should be performed too, also on regular basis. At least once per month for every profile. As any other check in the past, this could help to recover bugs in the SAM framework (not having to mention that experiments are not very happy if a bug is addressed on months timescale).

Deployment

2 days in advance should be fine, but it depends on SAM downtime duration for the experiment. The deployment should not be only announced, it should be agreed in advance with experiments, with each of them separately. In the same way like the EOS/CASTOR interventions are done... Jarka would strongly suggest to agree with experiments on date&time&duration of the intervention before WLCG Daily Ops announcement is done.

Comments from Andrea

Probes development

Having to install an instance for probe development seems a complete waste of time. Testing probes in the PPS worked very well so far, it never created any problem and makes the PPS useful for something. The only situation when this can be a problem is when it is needed to compare the PPS to the PS during the validation of a SAM update. But this can be easily solved by making sure that before the validation starts, the probes are synchronized between PPS and PS. I am against prescriptions that sound good on paper but create problems without solving any in practice.

I do not understand why I should use a tool called like a Go Nagai character to build the CMS probe rpm. I have a script that works nicely and I see no need to throw it away. I need to be convinced that there is an advantage is using this Koji.

Restricting access to PPS and PS is going to severely limit my ability to troubleshoot problems. For example, I sometimes use it to analyse test outputs from the job output sandboxes. Given that the SAM team cannot provide the effort needed to replace me, and that in all the history of the SAM deployment me having access to the services never created any problem, I strongly advise against this decision, or take it and accept to provide a worse quality of service.

Staged rollout

I agree with Jarka that topology differences are very serious and a 10% tolerance is too high. A topology difference between PPS and PS means that either a bug was introduced, or a bug was fixed. In the latter case, the existence of a difference is not a bad thing, but in the former 10% is really too much. I would rather propose to sample a smaller set of sites (let's say, 20 sites chosen randomly per VO) to reduce the effort but for these sites there should be no difference in the topology.

For CMS the most significant profile is CMS_CRITICAL_FULL.

About the priority level of the tickets, it is not OK if the SAM team defines it not in agreement with the LHC VO experts and the SAM developers. Otherwise, the SAM team will always have the power to force an update on the PS without the approval of the LHC VOs. Given that SAM is part of their computing operations, this might have serious repercussions. Besides, the sentence at the end regarding possible objections from the experiments seems to contradict the description of the SR process in this respect.

-- JuliaAndreeva - 19-Oct-2012

Edit | Attach | Watch | Print version | History: r8 | r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2012-10-22 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback