In preparation to the next meeting where we would like to discuss SAM operations, development and deployment aspects and collaboration between SUM developers and developers of the VO tests, David prepared a document.

Here we collect feedback to this document from SUM developers and developers of the VO tests.

Summary of comments after discussion of the SUM developers and the developers of VO probes

Development

● Better integration of SUM in the SAM software cycle ○ Regular attendance of the SUM developer to the SAM Developers Standup _ meetings (twice a week, 30 mins max.)_ ○ Improve understanding of the content of future SAM releases and start integration of SUM at earlier stage. SUM status update during meetings.

Agreed with small modification:

SUM developer will attend SAM Developers Standup max once per week.

○ Test suite for SUM validation is welcome for integration in SAM nightly validation.

SUM team (Jarka) had developed sanity check SAM API scripts which will be provided to the SAM team so that they could integrate them in SAM nightly validation

● Probes development

○ Must happen in local development instance. For development, the SAM team is using CVI VMs with configuration: SL5 x86_64 CVIVM image, 2GB RAM, 1 CPU, 40GB disc. On top of that we deploy the latest released SAM Update following the Installation Guide. Probes’ developers must configure a VO Nagios instance.

The developers of the experiment-specific probes have different opinion on this point:

The developers of the experiment-specific probes require the development Nagios instances (one per experiment) to which they would have root access. The developers of the experiment-specific probes expect these instances to be provided by SAM team with the latest SAM release deployed there and being upgraded by SAM team following the releases of the new versions.

Probes packages must contain: ■ the probes’ implementation ■ the definition of package dependencies ■ the metrics configuration files: 1. /etc/ncg-metric-config.d/.conf (configuration file for the metrics in the package) 2. /etc/ncg/ncg-localdb.d/_param_override.conf (config file overriding config of generic metrics like org.sam*) ○ Probes teams build packages using Koji. These can be taken and deployed in the PPS and Prod VO SAM-Nagios by the SAM team at any point in time (when requested by the developers through SNOW)

No objections, apart of mentioning which tool should be used for building of RPMs:

The developers of the experiment-specific probes are responsible for providing RPMs , but is is up to them to decide what to use for building RPMs.

○ Probes packages are expected to be validated by the probes’ developers in their development VO Nagios and in the PPS instance, once deployed.

Fine

○ Access to SAM VO Nagios PPS and Prod instances is restricted to SAM service managers only.

For various kind of troubleshooting in particular for debugging problems of publishing of the test results, non-privileged access to the SAM VO Nagios PPS and Prod is required. VO SAM tests troubleshooting is currently performed by a dedicated team of experts in the ES group and is a high priority task. In case SAM team strongly insists that access to SAM VO Nagios PPS and Prod instances is restricted to SAM service managers only, SAM team should foresee to take responsibility for such troubleshooting and to address all issues in a timely manner handling it as a high priority task.

2. Staged Rollout

● The SAM team deploys a pre-release in the pre-production SAM-Gridmon and VO SAM-Nagios. ● The SAM team informs EGI Staged Rollout and any other interested group (in _ particular ‘sam-hep-vo-contacts’) to start the official validation process._ ● Check list during the SAM Staged Rollout: ○ Schema validation: test that SAM APIs do not break. We expect that this has already been tested by the SUM team during the SAM nightly validation cycle (development phase). This is a second test to be performed by the SUM team. ○ Functionality validation: check that SUM functionality works (services and sites statuses and availabilities are returned). To be performed by the SUM team. ○ Topology comparison: if a VO feed is the same in both prod and PPS, topologies will be compared by the SAM team that will provide justifications in case of differences. Tier0 & Tier1 sites should be identical while for Tier2s, differences should be less than 10% for release acceptance (obviously differences in topology will be investigated and bugs fixed, but those few do not block a release.) To be performed by the SAM team.

SUM developers and ES experts who are in charge of the VO-specific SAM tests consider that 10% topology is not acceptable. Since topology is provided by sources external to SAM (VO feeds and GOCDB/OIM), in principle it should not depend on SAM releases. All cases of inconsistencies might be caused by bugs (introduced or fixed) therefore they should be investigated , understood and fixed if required.

○ Availability comparison: during five working days the SAM team will compare the daily availabilities for one specific profile per VO (to be decided which, VO_CRITICAL_FULL?) and will investigate and provide justifications for sites with differences above 10%. To be performed by the SAM team.

In order to get statistically significant results, consistency availability between two SAM releases should be performed for at least one week interval and difference should not be above 5%.

Validation of the new release performed by the SAM team including schema,topology and availability crosschecks is not included in the two weeks SR process. SR process starts under the condition that schema, topology and availability comparison gave satisfactory results.

●SAM Staged Rollout (SR) process: ○ The SR process will last 2 weeks. During this time, tickets can be opened against the release candidate deployed in the pre-prod instances. For the teams at CERN, this should be through SNOW. ○ For the opened tickets, the SAM team will define their priority level: ■ Blocker: Blocks development and/or testing work, pre-production cannot run. ■ Critical: Crashes, loss of data, severe memory leak. ■ Major: Major loss of function. ■ Minor: Minor loss of function, or other problem where easy workaround is present. ■ Trivial: Cosmetic problem like misspelt words or misaligned text. ○ Blocker, Critical and Major tickets will be fixed by the SAM Team before deploying a release in production.

The intended schedule of the SR process needs to be communicated to SUM & probe developers approximately one month in advance so that they could adjust their plans accordingly. SUM & probe developers should define the severity of the problem in the SNOW ticket considering impact on the experiments operations. In some cases when major problems are detected, under mutual agreement between SAM team and SUM & probe developers, the SR process can be extended over two weeks for careful validation of the fixed version.

3. Deployment ● Once a release has been validated by the SR process, the SAM team will announce its future deployment in production at least two working days in advance. ● Downtime will be scheduled in GOCDB and the experiments will be informed by the SAM team through the WLCG Daily OPS meeting. ● If no objections are received by the experiments, the SAM team will deploy the release in the production instances: SAM-Gridmon and VO SAM-Nagios.

We suggest to follow the scenario similar to CASTOR, EOS and other service deployment:

The intended schedule for deployment of the new SAM release is communicated via ‘sam-hep-vo-contacts’ or some other list created on this purpose which includes all interested parties. This schedule includes estimated downtime of the service in case intervention is not transparent. The schedule has to be agreed with the experiments within 3 working days and approval of the experiments is communicated via the same list. Two days before the upgrade the announcement is sent to the list and is made on the WLCG Daily OPS meeting. Naturally, the start and the end of the upgrade is also communicated through the same list.

Main questions we which might need to address during the discussion between SAM team, SUM developers and people who are in charge of VO-specific Nagios tests are

  • Machines, how many sets of machines we need, usage/purpose of every set, who is in charge for every set, who can do what and how, how we deploy them
  • Test suites for SAM tests, what SUM team can provide in order to partially automate validation
  • Validation process itself. How we agree when two weeks suggested by SAM team start (people can be busy with other things and cannot start validation asap)
  • How upgrade to the new releases is agreed with the experiments

Comments from Pablo:

First point in the 'Probes development'

We are going in the direction of having 5 sets of machine (production, pre-production, nightly build, sam development and experiment development). If we have to install the machines by hand, it will be a lot of work. So the suggestion is to use Quattor templates, and they have to be up-to-date

Staged rollout

the deployments on pre-production should also be announced to the experiments. The start of the validation process should be agreed with the sam-hep-vo-contacts.

It is not clear if the five days of 'availability and topology comparison' are part of the validation process or not. It might be good to divide the validation process in two parts: the first part when SAM team does their own testing, and after that has been successful, SUM and VO representatives start with the rest of the validation.

Would be nice to put max delay between the SNOW request and the installation on the nodes (something like 'less than 8 working hours').

Questions

o) The staged rollout does not include anything about validation by the experiments. Should it be there?

o) Do all of the SAM components have to be deployed at the same time? I mean, if instead of deploying the MRS, POEM, APT at the same time, they could be deployed one after the other, it would make all the comparison much easier.

o) In the scenario described in the document, preproduction is not that important. Is it really needed?

Comments from Jarka

Jarka agreed to attend the SAM standup meetings, max once a week when issues which can have impact on SUM could be discussed

Jarka has scripts which she uses to compare topologies, availabilities etc. She will provide them to SAM team who can integrate them in their tests.

Topology comparison

Jarka strongly disagrees that 10% difference in the topology is not considered to be a showstopper taking into account the fact that site availability has strong impact on site funding

Availability comparison

The specific profiles with top priority consistency check should be ALICE_CRITICAL, ATLAS_CRITICAL, CMS_CRITICAL_FULL, LHCb_CRITICAL. However, the remaining profiles should not be forgotten and their check should be performed too, also on regular basis. At least once per month for every profile. As any other check in the past, this could help to discover bugs in the SAM framework (not having to mention that experiments are not very happy if a bug is addressed on months timescale).

Deployment

2 days in advance should be fine, but it depends on SAM downtime duration for the experiment. The deployment should not be only announced, it should be agreed in advance with experiments, with each of them separately. In the same way like the EOS/CASTOR interventions are done... Jarka would strongly suggest to agree with experiments on date & time & duration of the intervention before WLCG Daily Ops announcement is done.

Comments from Andrea

Probes development

Having to install an instance for probe development seems a complete waste of time. Testing probes in the PPS worked very well so far, it never created any problem and makes the PPS useful for something. The only situation when this can be a problem is when it is needed to compare the PPS to the PS during the validation of a SAM update. But this can be easily solved by making sure that before the validation starts, the probes are synchronized between PPS and PS. I am against prescriptions that sound good on paper but create problems without solving any in practice.

I do not understand why I should use a tool called like a Go Nagai character to build the CMS probe rpm. I have a script that works nicely and I see no need to throw it away. I need to be convinced that there is an advantage when using this Koji.

Restricting access to PPS and PS is going to severely limit my ability to troubleshoot problems. For example, I sometimes use it to analyse test outputs from the job output sandboxes. Given that the SAM team cannot provide the effort needed to replace me, and that in all the history of the SAM deployment me having access to the services never created any problem, I strongly advise against this decision, or take it and accept to provide a worse quality of service.

I rather suggest a more reasonable compromise consisting in restricting root access to the PS service but allowing access via a non-privileged account that would allow me at least to read the configuration and temporary files of Nagios, and leaving things as they are today for the PPS.

Staged rollout

I agree with Jarka that topology differences are very serious and a 10% tolerance is too high. A topology difference between PPS and PS means that either a bug was introduced, or a bug was fixed. In the latter case, the existence of a difference is not a bad thing, but in the former 10% is really too much. I would rather propose to sample a smaller set of sites (let's say, 20 sites chosen randomly per VO) to reduce the effort but for these sites there should be no difference in the topology.

  • Maarten: those sites should not be random, but rather the 20 most important sites per experiment.

For CMS the most significant profile is CMS_CRITICAL_FULL.

About the priority level of the tickets, it is not OK if the SAM team defines it not in agreement with the LHC VO experts and the SAM developers. Otherwise, the SAM team will always have the power to force an update on the PS without the approval of the LHC VOs. Given that SAM is part of their computing operations, this might have serious repercussions. Besides, the sentence at the end regarding possible objections from the experiments seems to contradict the description of the SR process in this respect.

-- JuliaAndreeva - 19-Oct-2012

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r7 - 2012-10-24 - JuliaAndreeva
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback