LCGSCM Monitoring, Logging & Reporting Status

9 July 2008

  • Tue 2nd July - CERN network problem prevented SAM BDII reading site BDIIs (1 hour )
  • Wed 3rd July - All services in an 'ERROR' status due to host-cert tests failing - package missing on SAM UI
  • Some results corrected for June in Gridview (22-23, 6th) for outages. New general procedure being put in place to 'mask' results

25 June 2008

  • SAM - CERN hit limit of 100 nodes over weekend which stopped gstat tests working.
  • DB issues in GV to be covered in meeting with DB Devs.
  • Deployed new messaging based gridftp producers on all CERN disk servers. Testing message based L&B reporting system. Will be send to certification in next days. When deployed outside of CERN we'll turn off R-GMA and WS based publication at same time. * monb001 - R-GMA box to be shut off.

30 Apr 2008

  • SAM - DB intervention yesterday (Tue) to fix some tables in GV schema which had problem last time. All went ok - SAM turned off for 2 test submission cycles (2hours)
  • elog - moved from VM to a 'real' machine for duration of CCRC'08 phase 2

19 Mar 2008

  • SAM - downtime Friday lunchtime - Monday
    • Due to bad config + human error (didn't check)

  • Both:
    • DB want some space back. Short term - delete some CLOBs - ~150GB ?
    • Mid-term - produce policy on data expiration and approval by MB
    • Finally - move to solution where we purge daily/monthly the data from the schemas

27 Feb 2008

  • SAM UI upgrade still ongoing - SAM tests running at 50% frequency
  • Problem with SAM SRM Tests - weren't run for 3 days (they had been scheduled only on the SAM UI which was out of service)

13 Feb 2008

  • Final gridview services moved to new hardware. Old machines will be returned next week.
  • Gridview/SAM will need a downtime to cleanup the old entries in the table. Advantage of using the downtime is we can partition for the future.

23 Jan 2007

  • Gridview frontend will move to new hardware next wednesday. Fully Quattor managed. Other parts of service will move over the next 2 weeks.

16 Jan 2007

  • RAS

21 Nov 2007

  • Hardware acquired for new Gridview service - these are mid-range server nodes already

14 Nov 2007

  • OSG 0.8.0 released Nov 1st. With the release we will get full SAM tests for OSG sites appearing in the production SAM instance.

10 Oct 2007

  • SAM Unavailability (Tue, 02 Oct 16:30 - Wed, 03.10.2007 12:00) - DB Problems - understood and resolved
  • A new version of Gridview (new summarization algorithm) slipped into production last week
    • Work starting on turning Gridview service into production - based on work done for SAM service
  • lxdpm101 (dpm for SAM testing infrastructure) seems to be now important - should move into a more "production" state
  • Successful demo of nagios prototype & gridmaps prototype at EGEE conference (http://gridmap.cern.ch/gm)
  • First data published directly from OSG into SAM

22 Aug 2007

  • Nothing to report

15 Aug 2007

  • GridView new service availability calculation published and waiting approval by MB/GDB.
  • WG telecon. discussed deployment issues/testing of OSG's own and Nagios-based probe runners for OSG custom probes.
  • EDS consultant prototyping heatmap display for high-level view of site availability/reliability.
  • Testing SLC4/glite-3.1 UI/Nagios monitoring combination - some minor issues to be resolved.

11 July 2007

  • Work on-going on new service availability calculation for GridView. Will be presented to MB/GDB for approval/notification

30 May 2007

  • R-GMA based grid publication removed from CERN disk servers. Working now on rolling out new publication mechanism (SAM ws-client) and removing R-GMA based publication from other sites.

24 Apr 2007

  • R-GMA team have said that they think they've solved the TIME_WAIT socket problem which required frequent reboots of mon boxes.

18 Apr 2007

  • We have a consultant on-site from EDS, who is 60% on monitoring for 1 year. He may help out on the architecture of the new monitoring system.
  • Memory leak found in Gridview producer for gridftp logs on castor - fixed and fix deployed.
  • R-GMA not scaling for JobWrapper tests. Turned off R-GMA publication - Piotr provided instructions to site-admins in the operations meeting.

14 March 2007

  • At the OSG All-Hands meeting, discussions were had about OSG deploying SAM for testing OSG sites using their own custom probes. They've a target of June for this.

28 Feb 2007

Issues

  • Our monbox monb001 is completely overloaded and cannot cope with the number of requests it gets. We see more and more timeouts and we have to reboot the machine quite often. Lemon shows the the number of processes is constantly between 1000 and 2000+.

How can we make this service more robust? Is it possible to load balance this service and if, how?

21 Feb 2007

Gridview WS client deployed on Castor seems to running well - R-GMA is still losing data.

14 Feb 2007

Nothing to Report

31 Jan 2007

WLCG Workshop
  • Monitoring BOF and monitoring session on reliable services held with large participation (80 people in BOF)

17 Jan 2007

Work in Progress
  • Site survey done for monitoring tools currently used. Presentation to be made next week at Monitoring BOF
Edit | Attach | Watch | Print version | History: r30 < r29 < r28 < r27 < r26 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r27 - 2008-07-09 - JamesCasey
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback