GD Group Report for C5-30-Nov-2007 ---------------------------------- LCG deployment: --------------- - Total number of Sites (*): 245 - Software -> Num. Sites (*): gLite-3_1_0 -> 88 gLite-3_0_2 -> 132 gLite-3_0_0 -> 3 LCG-2_7_0 -> 6 unknown -> 16 - Status -> Num. Sites (*): ok -> 162 degraded -> 21 down -> 62 - Average of concurrently running jobs during this week (+): ~17k (*) Sites that are Certified _and_ Production _and_ Monitored by SAM: https://lcg-sam.cern.ch:8443/sam/sam.py To see this page one needs a grid certificate loaded in the browser. The calculation of the Site availability (Status) is described at: http://goc.grid.sinica.edu.tw/gocwiki/SAM_Metrics_calculation Software version is coming from the 'CE-sft-softver' CE test. Sites not supporting SAM 'CE' service, or not having sent results for this particular test during the last week, are counted as 'unknown'. (+) Job statistics taken from GStat: http://goc.grid.sinica.edu.tw/gstat/ http://goc.grid.sinica.edu.tw/gstat/total/GIISQuery_Usage_cpu_.html For the time being we do not report CPU numbers: 1. Not all the reported CPUs are actually available for grid jobs. 2. Sites with multiple CEs may have their CPUs double-counted. 3. GStat includes sites that are not considered by the SFTs. CERN Grid Operations managed by GD: ----------------------------------- * gLite WMS and LB nodes are now fully quattorized (package apt removed and is no more used for middleware installation). * We still have the same problems with the gLite WMS nodes (bug in the workload manager service), so we need to put these machines in drain mode from time to time. We are still waiting for the patches, or for the new SLC4 version of the middleware. * rpmverify installed and configured on all LCG RB, gLite WMS and LB nodes. * Installation and configuration of a new gLite DPM (lxdpm104) node for SAM. * One LCG RB node dedicated to CMS (rb128) is actually down since last week because of hardware problems (the motherboard has been changed but the RAID controller has not been replaced yet). * Current status of the gLite WMS, gLite LB and LCG RB nodes can be found at https://twiki.cern.ch/twiki/bin/view/LCG/CurrentStatusWMSLBNodesCERN. CERN ROC: --------- * GGUS6 tests completed (included cross-tests with other ROCs) and documented * GGUS6 - PRMS tests (checking the correct interaction between GGUS6 and CERN Remedy) completed and documented * Prepared for the GGUS upgrade on Nov. 29th CERN GRID Pre-Production Site (CERN_PPS): ----------------------------------------- * Nothing to Report. EGEE Pre-Production Service Coordination: ----------------------------------------- * gLite3.1.0-PPS-UPDATE10 was released to PPS and is currently in phase of pre-deployment testing. This update represents the introduction of a number of new services to gLite 3.1 for SL4 (32 bit). - glite-AMGA_postgres - glite-LFC_mysql - glite-LFC_oracle - glite-PX - glite-SE_dpm_disk - glite-SE_dpm_mysql - glite-VOMS_mysql - glite-VOMS_oracle The pre-deployment testing was finished for all services with these exceptions - LFC-Oracle (not tested) - dpm_oracle (not tested) - VOMS-oracle (not tested) - dpm_mysql (still in test) We are studying a way to move forward to PPS services already tested. * release of gLite3.0 Update37 to production in preparation: (To be announced today) The release contains: - MySQL server/client update - Updated Torque and Maui - FTS Cancellation WLCG Transfer Service: --------------------- * Transfer ranging from 100 to 470 MB/s, averaging around 220 MB/s per day. * Involving all major T1 sites. * Mostly traffic from CMS * 0 open ticket in total * Throughput plots: http://gridview.cern.ch/GRIDVIEW/ Service Availability Monitoring (SAM): --------------------------------------- * Unavailabilities: https://twiki.cern.ch/twiki/bin/view/LCG/SAMProdServUnavail From Tue 27-11-2007 23:00h to Wed 28-11-2007 13:00h Reason: SAM DB overload From Thu 29-11-2007 00:00h to 09:00h Unreliable availability metrics due to an RB overloaded. * Successfully testing publishing OSG tests results into SAM validation DB using the new messaging system. Grid Data Management: --------------------- Nothing to report. Gridview: --------- * Added a feature to the Gridview site availability reporting programs to include support for classifying sites according to EGEE ROCs. Earlier these programs could generate reports with WLCG specific classification (T0/T1/T2). gLite 3.x Build & Integration: ------------------------------ * Certification repository - gLite 3.0 (patches) 1 in preparation, 1 in configuration, 15 in certification - gLite 3.1 3 in preparation, 6 in configuration, 27 in certification * PPS repository - gLite 3.0 .. Next set of patches scheduled for release to PPS #1369 SLC3/i386/R3.0 DPM/LFC 1.6.7-2 7 - High - gLite 3.1 .. 3.1.0 PPS Update 10 in preparation #1349 glite-LFC_mysql metapackage for gLite 3.1/SLC4 #1350 glite-SE_dpm_disk metapackage for gLite 3.1/SLC4 #1352 glite-SE_dpm_mysql metapackage for gLite 3.1/SLC4 #1541 glite-LFC_oracle metapackage for gLite 3.1/SLC4 #1420 glite-PX metapackage for gLite 3.1/SLC4 #1472 glite-AMGA_postgres metapackage for gLite 3.1/SLC4 #1501 glite-VOMS_oracle metapackage for gLite 3.1/SLC4 #1540 glite-VOMS_mysql metapackage for gLite 3.1/SLC4 .. Next set of patches scheduled for release to PPS #1512 3.1 VOBOX 5 - Normal #1516 glite-yaim-core 4.0.3 for the 3.1 repository 5 - Normal #1521 Updated glite-info-templates 5 - Normal #1531 Updated glite-info-generic 5 - Normal #1544 patch for bug 29600 5 - Normal #1545 glite-yaim-lcg-ce 4.0.2-1 for gLite 3.1 5 - Normal #1546 glite-yaim-torque-utils 4.0.2-1 for gLite 3.1 5 - Normal #1552 lcg-info-dynamic-software 5 - Normal #1389 R3.1/SLC4/i386: GFAL and lcg_util update 7 - High * Production repository - gLite 3.0 .. 3.0.2 Update 37 in preparation: #1368 R3.0/SLC3: FTS cancellation (3.0.2 PPS Update 41) #1433 Updated Torque (2.1.9-4) and Maui (3.2.6p19-4) (3.0.2 PPS Update 41) #1498 MySQL server/client update (3.0.2 PPS Update 42) #1499 MySQL server update (3.0.2 PPS Update 42) .. Next set of patches scheduled for release to production: None. - gLite 3.1 .. Next set of patches to be released (no release date planned yet) #1255 JobWrapper tests - new version with no R-GMA dependencies (3.1.0 PPS Update 08) gLite 3.x Certification & Testing: ---------------------------------- * Certification: patches certified: #1389: R3.1/SLC4/i386: GFAL and lcg_util update #1369: SLC3/i386/R3.0 DPM/LFC 1.6.7-2 #1370: R3.1/SLC4/i386 DPM/LFC 1.6.7-1 #1501: glite-VOMS_oracle metapackage for gLite 3.1 and SL(C)4 #1512: 3.1 VOBOX #1516: glite-yaim-core 4.0.3 for the 3.1 repository #1521: Updated glite-info-templates #1531: Updated glite-info-generic #1540: glite-VOMS_mysql metapackage for gLite 3.1 and SL(C)4 #1544: patch for bug 29600 #1545: glite-yaim-lcg-ce 4.0.2-1 for gLite 3.1 #1546: glite-yaim-torque-utils 4.0.2-1 for gLite 3.1 #1552: lcg-info-dynamic-software Note that the DPM/LFC on SL3 needs the successful certification of patch #1555 (voms 1.7.24 + gSOAP 2.7) before it can be released * gLite 3.1 / SL4 - 32bit .. glite-VOBOX certified and scheduled for next PPS release. .. dcache for glite 3.1 is now in certification (on 32bit). - 64bit .. Writing tests to check 32 bit compliance of 64 bit WN * Configuration In response to continual requests to expose new Glue variables in yaim, a new approach to the yaim/infosys interface is being explored. Grid User Support: ------------------ * The new release of the GGUS portal is deployed on 2007-11-29. Release notes in: https://gus.fzk.de/pages/owl.php * A talk on LHC VO User Support was given on 2007-11-28 in the framework of the WLCG Service Reliability Workshop. Slides in: http://indico.cern.ch/materialDisplay.pycontribId=43&sessionId=2&materialId=slides&confId=20080 * Individual meetings take place with each experiment, starting with Alice on 2007-11-29. Agenda in: http://indico.cern.ch/conferenceDisplay.py?confId=10379Grid ETICS: ------ The new version of the ETICS system (2.0) has been deployed yesterday Wed 28 November. This version introduces redesigned web tools and new features asked by users in the past several months. The upgrade of the services did not present any problem. However, this release introduces a number of changes in the behavior of the system (agreed with the users), which will require some changes in existing configurations and build scripts. In addition, some of the new features are quite complex and although they have been tested for almost two months, we may expect some issues to arise. We foresee a transition period of a few days to sort out any outstanding problem. On the administrative side, we are now preparing the final deliverables for the project and started the organization of the final review which should take place on February 15th, 2008. OMII-Europe ---------------- Nothing to report Grid Operational Security: -------------------------- A new SAM security test revealed that different VOs are using incorrect file permissions at several sites, including CERN. Several causes have been identified, and they are all being followed up. Grid Authentication & Authorization Services: -------------------------------------------- Nothing to report about authentication and authorization services. The full GD report can be consulted on: --------------------------------------- https://twiki.cern.ch/twiki/pub/LCG/GDC5Reports ---Zdenek