Meeting 2011-02-16

Participants: Alessandro, Amol, Andrea, Julia, Lola, Maarten, Nicolò, Pablo, Stefan, Wojciech

Minutes (Andrea)

Julia raises the need to be able to use the new Programmatic Interface as soon as possible. There is already a Dashboard application prototype but the PI is still incomplete. The slides presented to the MB mention multiple availability calculations but not a timescale to have them; it would be very good to be able to use them.

Wojciech said that there are two separate steps to pursue in parallel: 1) to stop the SAM submission and use the old SAM database, and 2) to work on the new PI.

Julia pointed out that they are waiting for things to be done by IT-GT (not the other way around). For example, the test results are not yet available via the PI in JSON. What is the timeline for having the availability? It would be useful to start experimenting with multiple profiles per VO.

Maarten: for example the WLCG MB will want to see the effect on the availability of requiring a functioning CREAM at each site.

Pablo: one needs an algorithm also for the site availability, while right now the availability is calculated only for the service.

Julia: part of the algorithm is to define in the profile how to group different services.

Wojciech: people are working on implementing the OR of the LCG-CE and the CREAM-CE.

Julia: this is very important for CMS because they will have very soon this feature in the Dashboard and would not like to take a step back in functionality. Moreover, the agreement was that the Dashboard does only the visualisation. Finally, we could not do the OR in the Dashboard using Nagios information because the PI lacks some necessary information.

Wojciech explains that the reason to stop the SAM test submission is to get rid of two machines used by VOs to submit the SAM tests. Andrea says that this is not the case for CMS and the SAM tests can be switched off only after successful validation in the Dashboard.

Julia stressed again the importance of knowing exactly which features will be made available and when, and that the final validation with the new PI will take at least one month, possibly more. Wojiciech agrees and invites to discuss specific problems offline after the meeting.

Pablo asks if there are methods in the API for both the reliability and the availability. Amol mentions some known issues in the XML.

Julia proposes to circulate an email with the specific points they are most concerned about, to be completed with solutions and timescales.

Pablo raises some issues with the ATP, for example the fact that the CMS feed has implemented some required changes but so becoming invalid according to the XML definition. The changes (related to grouping sites) will be submitted to all VOs for approval.

Alessandro, given the delay in having an ATLAS feed from AGIS, would like to find a backup solution and for this to try to reuse the experience of the other experiments (also in terms of code).

Pablo also mentions that they do not necessarily know that there is a problem with an ATP VO feed (one one occasion it was compliant but still rejected). It would be useful to automatically be notified in case of problems with the VO feeds.

Alessandro suggests to start writing FAQs for the error messages we receive via email from Nagios, because it should be clear if and how important they are and if require an action. Very often all VOs get them, which is useful to know to understand if it is a VO-specific problem or not.

Alessandro also mentions that there is evidence that the downtimes are not always correctly reported in Nagios. Wojciech clarifies that they are used only for disabling error notifications for services undergoing an intervention, not to calculate the reliability.

Finally, Alessandro mentions the bug affecting the publication to the message bus of the test results for OSG. It turns out that ATLAS is the only experiment for which a workaround is in place, while the others (in particular CMS) are suffering from it. It is decided to apply the workaround also for CMS.

Meeting 2010-11-30

Participants

Alessandro, Andrea, Julia, Pablo, Patricia, Roberto.

Minutes (Andrea)

Database API

Malik's API will be dropped as many methods do not make sense. The final API will be described in a document available very soon. It will be provided via myEGI and released internally before Christmas, to be tested by the Dashboard team. The new API will require some changes in the Dashboard code. After the meeting David, Julia and Pablo went through the API documentation together.

Profiles

It was clarified that the algorithm (i.e. the formulas) will be the same for all VOs. One can also define a profile without an availability, for example to define a set of services and metrics to be executed on them. Profiles are currently created directly in the database via SQL insertion but they are working on an interface do create and edit profiles, to be released by the end of February (limited to display a profile definition) while possibly by the end of April it will be possible to also edit profiles.

ATP

It is already there, fed with the experiment XML topology feeds (ATLAS is still missing). It also contains downtime information. Downtimes can refer to whole sites or to single instances, like today in SAM, and the fact that a site is down for a given VO must be established based also on the profile information. The API for getting downtimes will also change and will be provided by myEGI. Julia reminded that downtimes are important not just for the reliability but also to display the downtime information itself.

Concerning ATLAS, at least for some time the XML will have to be generated using also BDII information, while the final goal is to have it generated by AGIS (but not before three months at least). CMS is already using only information from the SiteDB. The LHCb XML will get some updates.

[ACTION]: implement an alarm if the XML is too old, maybe distinguishing the case when the XML is old and when the information in the database is old.

MDDB.

See above for creating and editing profiles.

ACE

Refer to the WLCG MB slides.

MRS

No news.

myEGI

myEGI on the experiment Nagios instances will be upgraded as soon as the new Nagios is released (today or tomorrow). For myEGI, this week or next week. Looking for four virtual machines to be used for Nagios validation instances.

Issues

There are reports from the Nagios boxes, which are not an issue. Alessandro asked if we have to worry, Wojciech answered that it was due to the installation.

Second issue was for the OSG CEs: it's due to the fact that on the OSG WN the WN test cannot publish the informatrion in the message bus due to a missing environment variable used only for discoverying the message broker (LCG_GFAL_INFOSYS). Konstantin changed the code such that it does not make a discovery. The ATLAS machine has a more recent version, this could explain if the problem was not seen for CMS. The problem is not fixed right now.

David said that the Nagios nodes will be given to ES when they are production ready. As this was not what people from ES had previously understood, this point will need to be clarified. Alessandro said that he understood that ES would just have to provide the RPMs with the experiment probes. Julia wondered who should provide the MSG system used by the experiments: if it is a central service, should be run by PES. David says that for him it is clear that it should not be GT's responsibility.

Meeting 2010-10-12

Participants

Alessandro, Andrea, David, Nicolò, Pablo, Patricia, Roberto.

Minutes (Andrea)

  • David: The WLCG Management wants Gridview to compute availability of VOs for three months and compare Nagios with SAM.
  • Ale: for ATLAS we are testing only a few sites and we need to be able to exclude the service instances we do not want to test (for example DPM at CERN). Moreover, AGIS has only SRM instances in the XML for now.
  • David: I will check with Emir how it can be done. (Note: information on how to target specific services was sent to Alessandro the day after this meeting.)
  • Roberto: LHCb has the same problem. For the moment Nagios is running only at Tier-1 sites; in fact, SRM tests are not relevant at Tier-2 sites, while CE tests are but currently they are not run at Tier-2's.

  • Ale: in Gridview, shall we have the same critical services?
  • David: in the future it will be decided in the algorithm definition that can be different for each VO, and VOs will have full control on it. There may even be a different algorithm for different sites (for example Tier-1 and Tier-2).

  • Ale & Andrea: ATLAS and CMS would like LCG CEs and CREAM CEs to be treated as equivalent in the availability calculation.
  • David: it will be possible, it just needs to be put in the algorithm.
  • Patricia: it is not necessary for ALICE as the LCG CE is not used any longer.
  • Ale: we would also like to use the direct submission to CREAM.
  • Andrea: there is already a probe for that, which uses the CREAM CLI.

  • David: we should have the same three tests in SAM and Nagios for ALICE (now one is missing from Nagios).
  • Patricia: I will do it.
  • David: same for LHCb, and it is a critical test.
  • Roberto: it should not be critical and in any case it will be stopped as it overlaps with the SRM tests.
  • David: there is also for LHCb a file access test that is in the SRM sensor in SAM and in the WN for Nagios.
  • Roberto: it can also be removed.
  • David: for CMS, is the CREAM test the only one missing?
  • Andrea: also the CE and the WN tests with the production role. They need a new feature which should have a ticket. I will find the ticket number.

  • David: for tomorrow's GDB I will not show plots but wait one month and recalculate then the October numbers.

  • Pablo: it is not yet clear how the availability will be calculated and made available for the Dashboard.
  • David: it is still to be decided.
  • Pablo: do we need to continue supporting the Dashboard SAM portal and calculate the availability in the Dashboard? We need to know if more effort is needed on it.
  • David: about the availability you will be able to take it from the Nagios database.
  • Pablo: there are some cases where the current availability algorithm in the Dashboard is inadequate, for example at SARA for LHCb. This is why we would like to know if we have to make some development on our side.
  • Andrea: anyway we can adopt the interim solution consisting in just taking the Nagios test results from the SAM database without changing the algorithms.
  • David: there was a person working for ES who wrote a PI to extract data from the Nagios database. I saw the work that this person did the day he was leaving when he asked me if I could put his code into our SVN repository, which I did (http://www.sysadmin.hep.ac.uk/svn/grid-monitoring/trunk/metric-store/mrs-web/). We have opened a JIRA Task (https://tomtools.cern.ch/jira/browse/TTASK-25) to validate and release this code but that I don't know if it will work. ACE availability data will be retrieved through a programmatic interface, but this is not implemented yet.
  • Andrea: probably for some time there will be no other option than to have two different availabilities calculated by the Dashboard: one for Nagios and one for SAM. This will make possible to migrate from SAM to Nagios from the point of view of the experiments without a discontinuity.
  • Ale: for the next meeting it would be good to know what are the plans for the historical information and in particular if it possible to add the possibility of correcting a posteriori the availability in case there was a bug in a test. This is important to prevent the MB from using as input plots that are known to be wrong.

Meeting 2010-09-01

Participants: David, Roberto, Alessandro, Nicolo, Patricia, Andrea, Pablo Minutes taken by Andrea and Pablo

Status of migration:

ALICE

Ready to move to the production instance of nagios, and to switch off the SAM. The nagios tests are still not visible in MonALISA, but that can be fixed later on. There are plans to test the CREAMCE directly, instead of going throught he WMS. At the moment, there are 90 voboxes, and two services being tested

Once they move to production, David can make the nagios tests results visible both in the old system and the new system. This will make comparing the results much easier.

ATLAS

SRM tests have been improved (in collaboration with CMS/LHCb). There are 6 tests that can be run by each VO. At the moment, they don't test LFC. The missing things at the moment are the correct formatting of the output, and verification There is also a new CE test (local-file-access). In total, there are like 4~5 CE tests

ATLAS can also publish the results of external tests, thanks to NEET (Nagios External Experiment Tests) (presented also here). This is already working, documented, and running for two tests. It can be used also by other VOs. The next step is to put it in the pilot job framework, although there are still discussions about the number of entries, frequency and number of connections

CMS

The CE tests are running, although only with one role. Emir is working on a feature that would allow several proxies. It should be ready in one of the next releases (David checked after the meeting, and this is supposed to be already there (sam ticket)). ATLAS would also profit from this feature The SRM tests are ok. All the tests are running on all sites. Andrea gets the list of sites from SiteDB, and then runs the tests on all the services running on those sites

LHCb

All the tests have been migrated to NAGIOS. LHCb are running in parallel two different profiles (glexec and wn). They already have more tests in Nagios than what they used to have in SAM. From the CE tests, 20% of them are still missing, though. Those are the tests that go through DIRAC. Using NEET would help to publish these results, and they are looking into that. The SRM tests are also working fine. They get the same results in NAGIOS and in the old SAM DB

Topology

If there is an XML that satisfies the XSD template, the information is already in the ATP development database. However, this is only happening right now for CMS (XML provided by the dashboard team). For ATLAS, there is a minor problem with some of the tags in the XML, and the LHCb link was not there (it was put short after the meeting, but the XML doesn't satisfy the XSD). At some point this will move to the validation and production, but not in today's release.

Joshi will setup an ATP web interface to the development database, and we will be able to see this info

The missing part now is to make LCG-utils use the ATP. Emir already did something similar for Ops (although that one doesn't have the complexity of experiment-defined sitenames). Alessandro asked if it would be possible to get the topology from two different sources (BDII and the XML feed, for instance). David said that this wouldn't be a good idea (Andrea and Pablo agreed with David).

Alessandro also asked how the tests would get the ATP information (and in particular the space tokens). David suggested two ways:

  • From the test, contact the ATP interface
  • Defining one test per service, and passing the information either as an argument, environment variable or file. Still to be developed.

Andrea and Nicolo asked the SiteDB developers to provide the XML with the topology definition. Until that's ready, the Dashboard provides this feed (although without the spacetoken information for the time being).

Availabilites

Once the experiments move to the production database of nagios, the results can be published both in the old database and in the new one. In both of those databases, the current gridview algorithm can be used to calculate the availabilities. This way, it will be easy to compare both systems. David will change the criticality of the tests directly in the database. For the time being, If the experiments want to change the criticality of the tests in nagios, they will have to ask David. At some point, there will be a FCR-like interface that the experiment can use for that.

The next step would be to use the new ACE (developed by gridview), which will be the new availability calculation. This is under development.

Downtimes

ATP provides information about the downtimes. It also displays information of the downtimes using the ATP site names . At the moment, there are some issues that still have to be solved. In particular:

  • American downtimes are not there. According to David, the ATP-web interface is pointing to a local MySQL instance running on that box, instead of the ATP Oracle Production central DB, as it should be. Once they move to the oracle instance, everything will be there.
  • The interface doesn't display the services that are affected by a downtime. It always displays the whole site as being in downtime. sam ticket
  • It is not possible to search for downtimes that have been modified since a specific point in time. https://tomtools.cern.ch/jira/browse/SAM-775[sam ticket]]
  • The current xml doesn't include the last modification time for each downtime sam ticket and ticket

Pablo asked if the XML with the list of downtimes could be modified to specify if all the instances of a particular service are in downtime, or if there are any other working services. As an example, the current xml format for a downtime is:

< resource> < atp_site>CSCS-LCG2</atp_site> < vo_sitename>T2_CH_CSCS</vo_sitename> < severity>OUTAGE </severity> <service/> <starttime>01-12-2009 07-00-00</starttime> <endtime>02-12-2009 16-00-00</endtime> </resource>

For the Dashboard applications, it would be nicer if it included:

< resource> < atp_site>CSCS-LCG2</atp_site> < vo_sitename>T2_CH_CSCS</vo_sitename> < severity>OUTAGE </severity> <service> CE </ service> < host> grid01.cscs.ch </ host> < totalNumberService> 2 < totalNumberServices> <starttime>01-12-2009 07-00-00</starttime> <endtime>02-12-2009 16-00-00</endtime> </resource>

This would let us know that there is one CE in downtime in that site, but that there is another CE which is not in downtime. David said that this was outside the scope of ATP, and it should be implemented elsewhere. Andrea and Alessandro agreed with David.

Meeting 2010-07-27

Took part:

David, Konstantin, Alessandro, Nicolo, Patricia, Andrea, Roberto, Julia , Jamie

Current status

Described in details in the slides from Andrea (see attached file).

To summarise current status :

Most of tests are ported to Nagios. Some of them are already running as a constant flow (ALICE tests, CMS worker node tests hough not to all sites). In general (for all VOs having SRM tests) SRM tests are not yet running periodically, but first basic version of SRM test is available.

There is also a set of tests which are currently submitted by VOs not using SAM framework (ATLAS, LHCb), only publishing results to SAM. This category will stay in future (tests won't be submitted through Nagios). In this case test results should be reported to a specific topic and to a dedicated instance of MSG and Nagios server should be able to consume these results from there. It is clear what has to be done from the technical point of view, but was not yet started to be implemented. The part covering instrumentation of experiment workflows submitting tests and publication to MSG can be a good project for a student or a short-term visitor. Julia will think about such a possibility.

There are couple of issues mentioned by Andrea:

  • Output is sometime truncated. Needs to be debugged (In fact was debugged and looked to be fixed next day after the meeting)
  • Test results for OSG sites are not published since the environment variable containing the CE name is not the same at the OSG sites as at the gLite sites. It was decided to submit a ticket for interoperability group in order to have a common environment variable containing CE name at both infrastructures. As a work around Konstantin will provide a fix at the framework level checking for OSG environment variable contatining CE name.
  • VO tests should be submitted with two different roles depending on the tests. Currently it is not possible since the same VO can not have two different proxies, but Emir is implementing this feature. Should be available either with a coming release, or with the next one.

As a general comment from Konstantin: If people enabling VO tests in Nagios notice some lack of functionality, they should not try to fix this with some workaround, but should communicate their request to the SAM/Nagios team.

RPMs are currently created partially manually , partially with koji.

ALICE would like to pass to Nagios-based tests asap, but August is not a good time (many people are in vacations). Test results have to be imported into Monalisa. ALICE is planning to do it in September. When ALICE would drop SAM-based tests the results of the new ones will be published both to the new and old SAM DB. So Dashboard availability calculation can be preserved with minor changes up to the moment when other VOs would migrate completely and validate results in the new system.

Integration of Nagios with VO topology

We had several meetings where we discussed the common XML structure for topology description. The agreed structure is published on twiki.

CMS
Currently an XML for CMS topology is provided from Dashboard (implemented by Pablo), but in future should come from the primary information source which is SiteDB. The bug is submitted by Andrea. Not clear whether all info required for SRM test is there, has to be checked with Nicolo.

ATLAS
ATLAS AGIS system provides an XML in the agreed format. Alessandro found there few problems, which he is following up with the AGIS developers. The problems concern the OSG sites. So at least EGEE sites can be tested using AGIS XML.

LHCb
Cronjob quering DIRAC DB publishes an XML in agreed format.

ALICE
For the moment ALICE uses pure BDII information for CE tests which are run by OPS and there is a file published by Patricia containing the list of ALICE VOboxes which is fed into Nagios (implemenetd by David).

The correct way to go would be to use a single info source , one by experiment (topology API), insert this data into ATP, and Nagios should consume exepriment topology from ATP as well as a component which calculates availability. Currently it is not happenning like this. Different information sources are used , BDII is probably the main source. Since info in BDII is not always complete and uptodate, some filtering has to be done to define where tests should be submitted. So we should really try to get rid of this work around and rather use experiment topology.

We have to work on this issue. One thing is to make sure that experiment topology APIs are correct and contain all necessary info and coordinate with experiments in case it is not yet true (ES group). Second thing to make Nagios consume experiment topology from ATP and to use it for test submissions. Since this work requires participation of different experts, it has to be checked whether August is a good time for it. Konstatin will check with Emir and let us know.

Meanwhile some work around described above will be used to define in Nagios where tests should be submitted.

Validation

In order to help with validation of the test results, Dashboard team will try to provide people running tests with some simple interface which would allow to make comparison of test results in the old and new systems.

Since availability calculation is not yet in place, the usual dashboard availability plots can not be used for validation now. As soon as availability calculation is in place dashboard availability UI can be used to compare results and then to present results of comparison to the MB.

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt SAM-Nagios.ppt r1 manage 121.0 K 2010-07-28 - 17:09 JuliaAndreeva  
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2011-02-16 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback