In preparation to the next meeting where we would like to discuss SAM operations, development and deployment aspects and collaboration between SUM developers and developers of the VO tests, David prepared a document.

Here we collect feedback to this document from SUM developers and developers of the VO tests.

Summary of comments after discussion of the SUM developers and the developers of VO probes

Development

● Better integration of SUM in the SAM software cycle ○ Regular attendance of the SUM developer to the SAM Developers Standup _ meetings (twice a week, 30 mins max.)_ ○ Improve understanding of the content of future SAM releases and start integration of SUM at earlier stage. SUM status update during meetings.

Agreed with small modification:

SUM developer will attend SAM Developers Standup max once per week.

○ Test suite for SUM validation is welcome for integration in SAM nightly validation.

SUM team (Jarka) had developed sanity check SAM API scripts which will be provided to the SAM team so that they could integrate them in SAM nightly validation

● Probes development

○ Must happen in local development instance. For development, the SAM team is using CVI VMs with configuration: SL5 x86_64 CVIVM image, 2GB RAM, 1 CPU, 40GB disc. On top of that we deploy the latest released SAM Update following the Installation Guide. Probes’ developers must configure a VO Nagios instance.

The developers of the experiment-specific probes have different opinion on this point:

The developers of the experiment-specific probes require the development Nagios instances (one per experiment) to which they would have root access. The developers of the experiment-specific probes expect these instances to be provided by SAM team with the latest SAM release deployed there and being upgraded by SAM team following the releases of the new versions.

Probes packages must contain: ■ the probes’ implementation ■ the definition of package dependencies ■ the metrics configuration files: 1. /etc/ncg-metric-config.d/.conf (configuration file for the metrics in the package) 2. /etc/ncg/ncg-localdb.d/_param_override.conf (config file overriding config of generic metrics like org.sam*) ○ Probes teams build packages using Koji. These can be taken and deployed in the PPS and Prod VO SAM-Nagios by the SAM team at any point in time (when requested by the developers through SNOW)

No objections, apart of mentioning which tool should be used for building of RPMs:

The developers of the experiment-specific probes are responsible for providing RPMs , but is is up to them to decide what to use for building RPMs.

○ Probes packages are expected to be validated by the probes’ developers in their development VO Nagios and in the PPS instance, once deployed.

Fine

○ Access to SAM VO Nagios PPS and Prod instances is restricted to SAM service managers only.

For various kind of troubleshooting in particular for debugging problems of publishing of the test results, non-privileged access to the SAM VO Nagios PPS and Prod is required. VO SAM tests troubleshooting is currently performed by a dedicated team of experts in the ES group and is a high priority task. In case SAM team strongly insists that access to SAM VO Nagios PPS and Prod instances is restricted to SAM service managers only, SAM team should foresee to take responsibility for such troubleshooting and to address all issues in a timely manner handling it as a high priority task.

2. Staged Rollout

● The SAM team deploys a pre-release in the pre-production SAM-Gridmon and VO SAM-Nagios. ● The SAM team informs EGI Staged Rollout and any other interested group (in _ particular ‘sam-hep-vo-contacts’) to start the official validation process._ ● Check list during the SAM Staged Rollout: ○ Schema validation: test that SAM APIs do not break. We expect that this has already been tested by the SUM team during the SAM nightly validation cycle (development phase). This is a second test to be performed by the SUM team. ○ Functionality validation: check that SUM functionality works (services and sites statuses and availabilities are returned). To be performed by the SUM team. ○ Topology comparison: if a VO feed is the same in both prod and PPS, topologies will be compared by the SAM team that will provide justifications in case of differences. Tier0 & Tier1 sites should be identical while for Tier2s, differences should be less than 10% for release acceptance (obviously differences in topology will be investigated and bugs fixed, but those few do not block a release.) To be performed by the SAM team.

SUM developers and ES experts who are in charge of the VO-specific SAM tests consider that 10% topology is not acceptable. Since topology is provided by sources external to SAM (VO feeds and GOCDB/OIM), in principle it should not depend on SAM releases. All cases of inconsistencies might be caused by bugs (introduced or fixed) therefore they should be investigated , understood and fixed if required.

○ Availability comparison: during five working days the SAM team will compare the daily availabilities for one specific profile per VO (to be decided which, VO_CRITICAL_FULL?) and will investigate and provide justifications for sites with differences above 10%. To be performed by the SAM team.

In order to get statistically significant results, consistency availability between two SAM releases should be performed for at least one week interval and difference should not be above 5%.

Validation of the new release performed by the SAM team including schema,topology and availability crosschecks is not included in the two weeks SR process. SR process starts under the condition that schema, topology and availability comparison gave satisfactory results.

●SAM Staged Rollout (SR) process: ○ The SR process will last 2 weeks. During this time, tickets can be opened against the release candidate deployed in the pre-prod instances. For the teams at CERN, this should be through SNOW. ○ For the opened tickets, the SAM team will define their priority level: ■ Blocker: Blocks development and/or testing work, pre-production cannot run. ■ Critical: Crashes, loss of data, severe memory leak. ■ Major: Major loss of function. ■ Minor: Minor loss of function, or other problem where easy workaround is present. ■ Trivial: Cosmetic problem like misspelt words or misaligned text. ○ Blocker, Critical and Major tickets will be fixed by the SAM Team before deploying a release in production.

The intended schedule of the SR process needs to be communicated to SUM & probe developers approximately one month in advance so that they could adjust their plans accordingly. SUM & probe developers should define the severity of the problem in the SNOW ticket considering impact on the experiments operations. In some cases when major problems are detected, under mutual agreement between SAM team and SUM & probe developers, the SR process can be extended over two weeks for careful validation of the fixed version.

3. Deployment ● Once a release has been validated by the SR process, the SAM team will announce its future deployment in production at least two working days in advance. ● Downtime will be scheduled in GOCDB and the experiments will be informed by the SAM team through the WLCG Daily OPS meeting. ● If no objections are received by the experiments, the SAM team will deploy the release in the production instances: SAM-Gridmon and VO SAM-Nagios.

We suggest to follow the scenario similar to CASTOR, EOS and other service deployment:

The intended schedule for deployment of the new SAM release is communicated via ‘sam-hep-vo-contacts’ or some other list created on this purpose which includes all interested parties. This schedule includes estimated downtime of the service in case intervention is not transparent. The schedule has to be agreed with the experiments within 3 working days and approval of the experiments is communicated via the same list. Two days before the upgrade the announcement is sent to the list and is made on the WLCG Daily OPS meeting. Naturally, the start and the end of the upgrade is also communicated through the same list.

-- JuliaAndreeva - 19-Oct-2012

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2012-10-25 - JuliaAndreeva
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback