LCG Management Board


LCG Management Board at BNL
Tuesday 5 September from 9:00 to 11:00 (15:00 to 17:00 CERN time)




(Version 2 - 15.9.2006)


A.Aimar (notes), D.Barberis, L.Bauerdick, L.Betev, K.Bos, N.Brook, T.Cass, Ph.Charpentier, D.Foster, B.Gibbard, J.Gordon, F.Hernandez, J.Knobloch, R.Jones, M.Lamanna, H.Marten, P.Mato, M.Mazzucato, G.Merino, B.Panzer, H.Renshall, L.Robertson (chair), M.Schulz, R.Tafirout, J.Templon

Action List

Next Meeting:

Tuesday 12 September at 16:00 as usual

1.      Minutes and Matters arising (minutes)


1.1         Minutes of Previous Meeting

MB Minutes Page

Comment from F.Hernandez: The
minutes have been modified accordingly (changes in blue)


2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

  • 30 Jun 06 - J.Gordon reports on the defined use cases and policies for user-level accounting in agreement with the security policy working group, independently on the tools and technology used to implement it.

J.Gordon will present some use cases at the GDB in September.

  • 31 Jul 06 - Sites and experiments should check and update the data in the "Contact Page" page below.

From J.Shiers: The contacts page on the SC Wiki has been used for this purpose for >1 year and is regularly updated.

Sites and Experiments should check it before end of the week (10 September).


  • 31 Jul 06 - Sites should exchange more information about monitoring, alarming and 24x7 support in the framework of HEPIX.

Done. Will be associates to the HEPIX meeting.

I.Bird proposed to HEPIX the organization of a workshop during next HEPIX meeting in October. For the moment it is not clear whether it will be possible. More news from I.Bird in the coming weeks.

J. Shiers notes that the SC Tech Day, to be held at CERN on September 15th 2006, has a session devoted to this issue.

  • 31 Jul 06 - Experiments should express what they really need in terms of interoperability between EGEE and OSG. Experiments agreed to send information to J.Shiers.

Not done. Is the current situation satisfactory and we do not need this action? One more week for the experiment to answer.

  • 15 Aug 06 - Sites should check the values of the resources availability and required and confirm them to H.Renshall. Provide explanations when there are fewer resources available than required.

Not complete. One more week then we consider the values agreed by sites and experiments.

  • 31 Aug 06 - Gonzalo Merino - to see if one of the Spanish Tier-2 federations will give a report on their status and difficulties at the Comprehensive Review.
  • 1 Sep 06 - Bernd Panzer - to organise a meeting with the experiment coordinators to review the effects of the revised estimates on the costs of the CERN facility.


1.      SC4 Status and Sites Reliability (more information, transparencies) - H.Renshall


H.Renshall provided a summary of the status of SC4 for each of the experiments.


Most information is from ATLAS who has already made a post-mortem of the July-August run. All experiments run more all less continuous Monte Carlo production typically 50% coming from Tier 2 sites. All collect data at CERN for re-use in export and reconstruction tests.

1.1         ALICE

ALICE are using FTD (their File Transfer Daemon) over FTS to drive Tier 0 to Tier 1 exports at the full nominal rates expected during heavy ion running (i.e. HI rates spread out over the expected 4 months of the machine shutdown – 300MB/s out of CERN).


-          Started end July but still not ramped up to steady state above 50 MB/s

-          Tape end points were established

-          Had early problems to harmonise Alien versions at participating sites (fixed during early August)

-          Optimisations in the interaction between FTD and FTS were required as initially the failure of a single site could block transfers to all sites (solved mid-August)


-          Only 5 Tier 1 sites which include currently problematic CNAF, FZK and RAL (others sites are IN2P3 and SARA).

-          ALICE also worried about apparent limited tape space quota at RAL (no automatic tape recycling)


-          Maximum rate achieved so far was 150 MB/s for a short period

-          Intend to carry on till 300 MB/s rate reached for a sufficiently long period.


-          Resolving daily problems as they arise but see erratic response to GGUS tickets.


L.Robertson asked whether only RAL is not recycling the tapes for ALICE. L.Betev confirmed that the other sites have a schema for tape recycling. J.Gordon noted that RAL does not know which tapes can be recycled or not. L.Betev explained that on other sites there is one SRM endpoint for data to store and a different endpoint for data that can be erased by the sites (and the tapes recycled). RAL and ALICE should clarify how to do tape recycling.


1.2         ATLAS SC Status

Planned 19 June till 7 July to send 772 MB/sec "Raw" (at 320 MB/s), ESD (at 252 MB/s) and AOD (at 200 MB/s) from Tier 0 to Atlas Tier 1 sites, a total of 90K files per day.


Organised as two separate activities by Atlas with separate but overlapping teams and overlapping in time for a few weeks. Aiming for full experiment data rates.


The two separate activities are:

-          Tier 0 part to stream in simulated raw data into a large disk pool (t0perm) and ‘process’ this data in a dedicated LSF cluster to create ESD and AOD back into the disk pool for migration to tape and export to Tier 1. Wanted one full week of stable running. Real first-pass software not used.

-          Distributed Data management part to store data to tape at Tier 0, distribute raw, esd and aod to Tier 1 (raw to tape storage) and selected Tier 1s to distribute aod to Tier 2. Use Atlas DQ2 and DDM software. Integrate this with the Tier 0 part then continue data export part (no CERN LSF processing) permanently.


Complete operation to be repeated from 18 September for 2-3 weeks (one week ramp-up plus two week running).


Slide 4 show a snapshot of the Tier-0 Dataflow with the value of the lemon monitoring:

-          Achieved Event Farm input to Castor (333 MB/s)

-          Achieved Castor to Tape (420 MB/s)

-          Achieved from Castor to CPU Farm (315 MB/s)

-          Achieved CPU Farm to Castor buffer (136 MB/s)

-          Reached half the throughput between Castor and the Tier-1 sites (350, instead of 720 MB/s)


1.3         ATLAS Tier-0 Post Mortem


-          The DDM/LFC operations were decoupled leaving dependencies only on CASTOR, LSF and AFS.

-          AFS unexpectedly remained the bottleneck to reach the nominal rates


-          CASTOR exceeded the goal of 1 week of stable operation but with a pool 2-times over-dimensioned and ATLAS wasted time trying to understand its performance.

-          Communication with the CASTOR team did not work well, neither via Remedy nor personal emails.
ATLAS expected special attention while CASTOR committed to ‘standard’ support.


-          Suffered from building 513 cooling problems and slow recovery from CERN-wide power cut.


Overall LSF performed worse than in previous test – think a dedicated instance for first pass processing might be needed.


AFS is now a primary bottleneck and cause of job failures. Being followed up (e.g. looking at volume replication).


Also Ph.Charpentier expressed worries about the AFS performance and for the fact that AFS is still not accessible via the grid. H.Renshall answered that work is going on in order to add replication and grid access to AFS at CERN.


ATLAS would prefer a single communications channel into IT for Tier 0 operations


T.Cass noted that many issues on the slides are mostly solved and more information will be distributed to the MB.

1.4         ATLAS DDM Post Mortem

Achieved many objectives: Full scale exercise with realistic data sizes and complete flow; included all T1 from first day; included 15 T2 sites by the end of the second week; maximum export rate reached 700 MB/s (nominal was 780 MB/s including NDGF)


Problems encountered:

-          Intermittent site SE problems found by ATLAS rather than by the sites – no constant monitoring of SE health by sites

-          Only had all 9 sites active at once for a few hours over the 4 weeks

-          Same people at sites seem to be doing development, operations and coordination and no constant follow up to problems

-          Weak LFC service by the sites, often unclear LFC service status

-          Memory leaks and other overflow conditions found in ATLAS software (fixed)

-          Some interference by non-SC4 activities (e.g. failing FTS transfers)

-          Throughput per stream per site varies a lot

-          Fast changing CASTOR configuration (for the Tier 0 tests) caused export downtimes and non-stable data flow

1.5         ATLAS DDM Summary per site (from M.Branco)


ASGC: after VO BOX upgrade, went very well. 100 MB/s when ATLAS runs; 40~50 MB/s when CMS runs (should be 60 MB/s); communication problems during start-up of exercise


BNL: not using realistic tape area; suffering from read/write contention when using ‘production’ areas (as opposed to SC4 /dev/null area); very good support for ATLAS


CNAF: unstable Castor-1; now fighting Castor-2 installations. Needs re-evaluation during next phase


LYON: very good service T0->T1 and T1->T2!


FZK: after VO BOX upgrade, went better. Still very unstable service (in/out of the exercise all the time)


PIC: stable service; dCache disk area and Castor tape area occasionally suffering some timeouts/overload issues


RAL: not stable; difficult to understand status; could not sustain rate for a few hours


SARA: very stable service overall


TRIUMF: remains stable; network distance leads to occasional LFC connection glitches



ALL SITES: must monitor their service proactively, as no site except LYON, was constantly part of the exercise (excluding scheduled downtimes)


J.Templon noted that is very difficult to monitor the sites with the current tools available.

B.Gibbard said that the issue was not known to BNL as such and more information is needed.

L.Robertson recommended that Post Mortem reports must include the “status at the end” in order to see what has been fixed and what is still an issue.

1.6         ATLAS DDM Conclusions and Next Objectives

Goal not achieved but limitations understood.


The major issues preventing the success are persistent, intermittent instabilities at the sites’ storage. ATLAS would be in good shape for a re-run provided all parties are paying due attention


ATLAS want to clear some doubts on the data management architecture for data taking

-          eg. usage FTS VO agents, role of LFC, role of LCG Information System

-          Agree on “final” Tier-0 CASTOR architecture for export


They are concerned with (serious) lack of manpower for ATLAS distributed operations


Next run at same rates ramp-up from 18 September then 2 weeks with:

-          More stable running

-          Improved LFC service

-          Lighter DQ2 site services (ATLAS Software)


1.7         CMS SC Status

The key activity this year will be CSA06 in October and first half of November (see with other earlier activities of preparing/validating.


-          July: debugging data flow in and out of CERN using Phedex over FTS – target of 150 MB/s out of CERN reached for a few days. Slow transfers into CERN due to use of loopback interface for SRM and also incorrect handling of duplicate name server entries

-          August: activity included an attempt to transfer disk-to-disk out of CERN at 500MB/s for a week, minimum threshold 300 MB/s for 3 days

        Started badly with CERN-wide power cut

        Achieved over 300 MB/s for 6 days

        stability affected by out-of-context CASTOR stager query reply to Phedex triggering pre-stages (work-around rapidly found)

-          Transfers of MC events for CSA06 into CERN going on in parallel

        fraction of transfers running very slowly traced to bad HTAR interface

        Aggregation of OSG data through FNAL was delayed

        All data now at CERN

-          Simulation of raw data transfer at 25% of nominal rate out of CERN to tape at Tier 1 is now being restarted

        Aggregate rate of 150 MB/s out of CERN

        To be stable for at least one week

-          In parallel debugging/tuning of gLite RB to be used in CSA06 has been going on following early CMS tests. Concerns about poor Raid array performance (timeouts writing Logging and bookkeeping).


-          CSA06 includes running 50000 grid jobs per day, splitting raw data into event streams for export to Tier 1 and prompt reconstruction at CERN for shipping to Tier 1. Emphasis on reliability and stability.


L.Bauerdick added that CMS wants to use the sites to be used for analysis and CMS users will submit jobs.

1.8         LHCb SC Status

Main activity is termed DC06 from end July to end September or when 25 M events each of high and low luminosity will have been completely reprocessed. This means that:

-          Simulated data is shipped to the 6 Tier 1 (and CERN) depending on site resources but typically at 3 MB/s per site

-          NIKHEF only being used for MC event generation until ROOT/dCache access across the SARA/NIKHEF firewall is fixed


CERN ran smoothly for first month

-          MC event aggregation suffered from HTAR problem

-          Would like transparent access for batch software installation to AFS (not usable in a transparent way for now)

-          Batch job flow suffered from unstable publication of WMS availability (now much improved)


RAL running smoothly – slow data access fixed by adding capacity


IN2P3 running smoothly but also have unstable information system

-          Using disk-only storage because of gsidcap issues

-          Longest queue doesn’t fit with simulation jobs


Update: The IN2P3 queue time limits have been increased.


PIC running its share. Some issues with storage and recent problem with DIRAC pilot jobs failing to pick up real jobs.


CNAF potentially largest resource but has long standing problems migrating to CASTOR2.

-          Shared CASTOR instance among experiments overloaded (now fixed)

-          Single disk server for LHCB not enough (now fixed)

-          Durable disk pool (no garbage collector) becomes full (now fixed).
Not filled by LHCb but by some other VO.


FZK poor usage for reconstruction jobs so mainly Monte Carlo

-          Main problem seems to be gridftp daemons that close their sockets (under investigation)


NIKHEF/SARA only used for MC event generation so far.

-          Patched dCache client (no call back across firewall needed) is under test with good help being given by local site administrators



A discussion started about using a “pull model” instead of a “push model” for the job execution at the sites. No conclusions were reached.

1.9         General Observations until Now

-          Not able to demonstrate full nominal Tier0-Tier1 transfer rates (1.6GB/s) over extended periods, let alone recovery rates (targeted at twice nominal).

-          Experiment-driven data transfers (ATLAS and also CMS) achieved rates close to the target of full nominal rates for a single experiment (about half of the total rate for all experiments) under much more realistic conditions than for previous DTEAM transfers. For this reason, this is considered a positive result.

-          Sites appear to be able to focus their full attention on a specific experiment or challenge for a few days only. Indicative of the high workload at the sites.

-          Several sites appear to suffer from significant manpower shortages, which impacts both the service level that they are able to provide and the response time to requests (both “setup” and problem resolution). This is exaggerated during the holiday months.

1.10     Some Recommendations and Actions

-          Streamlining of reporting to the weekly combined operations meeting – now to be held on Wednesdays at 15:00 Geneva time – and the various LCG coordination meetings (LCG Resource Scheduling Meeting Mondays at 15:00, LCG Service Coordination Meeting Wednesdays at 10:00) has been put in place.


N.Brook asked that a clear mandate for each meeting should be written and discussed at the MB.



15 Sep 2006 - H.Renshall and J.Shiers will distribute to the MB a summary with the mandate, input, output and participants of the OPS, SCM, RSM meetings.


-          Site monitoring of local services still needs further improvement. Sites are encouraged to share their experiences during the SC technical day at CERN on 15 September.

-          A WLCG “Service Dashboard”, allowing to clearly see the status of critical components should be implemented as soon as possible to replace the laborious manual expert intervention – typically scanning log files – that is currently required.


Would be good to have a presentation of ARDA work in this area during the SC technical day.


-          A rotating ‘Service coordinator on Duty’ full-time during LHC runs should be established. This could be staffed across Tier 0 and Tier 1 sites typically for 4 weeks each during a year.


-          Sites of particular concern in the last few months include NDGF (not able to participate to this phase of SC4); FZK (unstable service, particularly during ATLAS’ activity); CNAF (unstable service during CASTOR2 migration – hopefully now improved); RAL (effectively unable to participate from a certain point due to disk controller problems described in LCG Q2 report).


-          A regular (quarterly?) WLCG Service Coordination meeting, where the Tier0 and all Tier1+Tier2 federations as well as the experiments are represented, should be established. This should review the services delivered by that federation, main issues encountered and plans to resolve them and take a longer forward look at experiment plans (to be discussed in the SC Technical meeting).


-          The importance of adequate preparation by the experiments has been clearly demonstrated, as has the need for constant background activity to exercise the global WLCG services.



15 Sep 2006 - The MB should send feedback about the proposal to H.Renshall.


L.Robertson proposed that in the SC Tech meeting when discussing the Post Mortems it should be made clear which problems and issues have been fixed and which are still outstanding.


2.      Summary of Reliability Issues - (more information, transparencies) - L.Robertson

SAM Reports Page containing the SAM Data and the Sites Analysis. Link


L.Robertson summarised the reliability issues observed in the last quarters.


2.1         Summary

The SC4 Reliability Targets were:

-          8 Tier-1s and 20 Tier-2s
must have demonstrated availability better than 90% of the levels specified MoU

[adjusted for sites that do not provide a 24 hour service]
reliability measure

-          Success rate of standard application test jobs greater than 90% (excluding failures due to the applications environment and non-availability of sites)


Slide 3 shows the result of August 2006:

-          All sites assumed up while SAM had problems (1, 3, 4 August)

-          The average was 74%for all sites

-          For the best 8 sites the reliability was 85%

-          Including the scheduled downtime the difference is only of 1%

-          3 sites exceeded the reliability targets (CERN, IN2P3, ASGC)

-          3 sites within 90% of the target (SARA, TRIUMF, PIC)


Slide 4 shows the results May to August for each site and shows little improvement over time.

2.2         Issues

-          The current tests were intended to be the first set of tests, easy for a Tier-1 site, only testing the very basic set of services

-          Then we would add tests of the other baseline services, and more realistic VO specific tests

-          And then we would start testing the major Tier-2s

-          Started well, but the situation has stagnated


-          Many of the tests for other services have not yet been written (despite commitments in the spring)

-          Only measuring half of the NL-Tier-1, and have no data yet from BNL (problem of integration with SAM now solved?) and NDGF


-          Maybe the target is too high?

2.3         July-August failures and comments

Of 6 sites that gave enough information to make an estimate and after removing known period when SAM failed

-          One third of the down time attributed by the sites to SAM

-          Two thirds real problems


Significant down time for some sites due to issues with Castor (CNAF) and dCache (FNAL, RAL) – that will go away when these problems are solved


CE overload and co-location of CE and BDII

-          a major cause of what sites see as spurious problems

-          but this is a known problem – why no previous action to separate these?


Cooling - a problem during the summer does everyone have this under control for 2007/8/9..?


The difference Availability v. reliability is only of ~1% and in some cases downtime is scheduled quickly when the problem is discovered.

2.4         SAM Issues

One of the issues is that SE tests run as dteam VO:

-          Sites consider them unrealistic, low priority for site to fix

-          VO-specific tests would get more attention


Some SE tests fail because of (possible) problems at CERN


The SAM System is being developed and for the moment it is:

-          Hard to correlate test failures, compare with local logs, difficult to retrieve test history, archive (~6 weeks) too short. One of the reasons could also be that the sites look at the SAM test results only when the reports are published.


R.Tafirout reported in detail problems of correlation between SAM failures and site logs at TRIUMF.
G.Merino noted that in the case of PIC there seemed to be a shift of one day between SAM detected problems and problems recorded at the site, but, apart from that, the PIC logs and the SAM results matched well.


-          SAM daily calculation is seen as inaccurate because of the low frequency of the tests and of the lack of test retries when failures occur. (should test again after short delay to avoid exaggerating transient problems).

-          SAM system failures are not automatically recognised and removed from the statistics


We need to look into other specific cases, rapidly after the discrepancy is detected..


M.Schultz noted that VO specific tests can already be added now to SAM and this would help to increase the attention of the sites. He also noted that often the failure of SAM tests (when not due to a problem of the tests) shows real problems at the sites, and therefore should be monitored regularly.



M.Lamanna added that the initial analysis of the job failures is showing that the error causes often are discovered by the SAM tests, more results will be published in the near future.

2.5         General Comments

L.Robertson made these general comments (slide 8):

-          impression that some sites are looking at these test results only when triggered by the MB

-          Need to integrate the grid view of the site into the local site monitoring

-          Local site management should be looking at the grid view of the site as well as at more local metrics


-          While sites see that they are running MC production well, despite indifferent test results –
-- there does seem to be a correlation with the more realistic Tier-0/Tier-1 exercises now going on (e.g. ATLAS)
-- we need to have a good set of grid metrics to cover the central and increasingly complex role of the Tier-1 in the overall LHC operation


-          Summer coverage very thin at some sites
-- the first major run will start in the summer of 2008
-- better feed that into holiday planning now

2.6         Next Steps

Improve the test suite

-          Add the VO specific tests

-          Additional basic service tests – but few of the tests promised in the spring have been delivered

-          Fix or eliminate tests that have deficiencies or are of no real value


Add job reliability measures

-          ATLAS and CMS dashboard measures?

-          CE job wrapper monitoring


Increase visibility – web accessible site dashboard


Improve the SAM framework


Extend to Tier-2s

Re-assess the target MoU values



The MB discussed whether there should be a GDB Working Group to rapidly agree on improvements but I.Bird added that an overall proposal should include also the aspects of monitoring, deployment and middleware.


I.Bird volunteered to coordinate and write the proposal of the overall strategy about sites testing and monitoring.


3.      xrootd interface to DPM - proposed planning - (transparencies) - Ian Bird


3.1         Implementation

Is important that this proposal covers only single site data access; intra-site access is not included


The work to be done includes:

-          xrootd “redirector”

XMI plugin to be written to talk to the DPM service

This has to have a GSI credential (proxy) in order to talk to DPM; configure this to use an ALICE user proxy coming from the VO Box.

For a more general (non-ALICE) implementation the redirector would need a GSI plugin and code to pass the proxy across to the XMI plugin. This would be the responsibility of xrootd developers

-          xrootd on the disk server nodes:

Need to write the OSS plugin to do the I/O to the disk:

- This must talk to DPM service to check file ownership and get location, or to create a new file

- Here it is a trusted client and does not need a proxy: but must specify a valid user/group


NB. This user/group must be consistent to allow file access via other mechanisms (e.g. SRM. Gridftp, etc)

3.2         ALICE-specific Solution

The work proposed at present will be ALICE-specific because of the ALICE security model


To be more general and use GSI credentials additional work would be required.


-          Needs a suitable GSI authn and authz plugin

-          Pass through of the proxy to the XMI plugin

Pool node:

-          Needs a GSI authn and authz plugin and pass through to the:

-          OSS plugin to pick up the proxy


Note: The xrootd client would need to send the proxy with the request

Note: Most of this work would be the responsibility of the xrootd developers, apart from the DPM-specific plugins.

3.3         Timeline

xrootd implementation in DPM for ALICE:

-          Estimate 6 weeks, work is starting now (David Smith at 50% of his time)

-          Mid-October:
A prototype that is tested for functionality, but not performance
Ready for wider deployment testing at a few sites


A more general version using GSI would take longer and require effort from both the xrootd and DPM teams in addition to what is proposed here.


L.Betev noted that the security is managed by ALICE and is not needed by DPM. I.Bird stressed that DPM access is secure and so, even in the case of an ALICE-only implementation, an appropriate GSI proxy must be provided. Other members expressed worries that there was a risk that the agreed security model would be broken, and that the go-ahead should not be given until this is better understood. L.Robertson summarised that we should come back when the full implications have been reviewed by the developers and someone from the security team. This could be reported to MB already next week.


4.      Decision on Accounting Data for the C-RRB - (document, tables, transparencies ) - L.Robertson


L.Robertson asked for decisions on what will be presented to the C-RRB.


Period – April through August?

August could be included because there is time for doing the calculation.


CPU or Wall Clock?

L.Robertson said that reporting both will create more confusion than clarifications.

J.Gordon and T.Cass said that using CPU time would show the usage (or lack of it) of CPU resources and efficiency of VO jobs.


J.Templon instead was more in favour of providing Wall Clock information.

H.Marten said that in FZK Wall Clock time is measured because this is what is available (and sold) to the CC users.


G.Merino said that efficiency values are useful in order to tune and configure correctly the computers in the farm.


I.Bird mentioned the fact that CPU time is what is requested by the C-RRB. But Wall Clock time is useful too and should anyway be collected, just not reported to the C-RRB.


Breakdown by experiment?

L.Robertson proposed to show the resources usage by VO.

L.Bauerdick recommended that the cpu efficiency of 85% is explained to the RRB and they should NOT expect 100% utilization. In addition during commissioning of the system the utiliszation can be even lower.


Comparison with capability of installed capacity?

Comparison with the MoU pledge for 2006?


Slide 2 shows that about 60% of the CPU installed is used, and 40% of the CPU pledged.


The 85% efficiency factor is already included in the utilization calculation, and so 100% utilization, or even higher, is achievable.


H.Marten asked how the value of efficiency was defined. L.Bauerdick said that the numbers come from previous experiments and 100% would just be a target that will be failed. Therefore a realistic target was fixed at the time of the preparation of the TDRs..


T.Cass noted that experiments recognize that they can only use 85% of the CPU. Therefore sites should only worry about efficiency when the utilization level falls much below this.


Note that RAL achieves 257% utilization because there are resources installed for other experiments that can be used by LCG jobs if they are idle.


It was agreed that the usage by VO could be shown for information to the C-RRB, as long as this is put into the context of the current test and commissioning phase, far from steady running.



11 Sep 2006 - Les Robertson will propose to the Overview Board to show to the C-RRB the cpu, storage utilization, installed capacity, perhaps the pledge, with appropriate caveats, along with summary usage information by VO.


5.      AOB


5.1         Summary of the presentations at the GDB on 24x7 Support (from L.Robertson)



  • The campus services provide an 08:00 to 18:00 service only, not easily extendable to the needs of GridKA
  • invested effort in failover and redundancy design, with automatic error detection and restart
  • use Nagios for system monitoring, linked to an alarm system using SMS on GSM phones
  • service desk integrated with GGUS - only manned during working hours
  • operators use a common central documentation system
  • existing on-call service on a 24 X 7 basis for physical infrastructure problems (electricity, cooling, smoke, ..) - which interacts with the site services
  • network has already an on-call 24 X 7 service operated by an external service provider
  • Plans -
    • train operators to take over day to day functions of the sys admin experts
    • extend operation to weekends - remote access and reasonable compensation, but this
      has to be agreed by the Betriebsgerat and the GridKA overview board - to be
      proposed/agreed in November.
  • the coverage overnight is not yet clear - but 24 hour callout is already used for other services
  • Convert the SAM/SFT results into Nagios message to be integrated into the alarm system
  • The recently announced fusion of the FZK Institute for Scientific Computing and University of Karlsruhe Computing Centre may influence the 24 X 7 methodology, and may impact the timescale for its introduction.


  • Physical infrastructure is covered by on-site emergency generators, 24 X 7 network operator (SURFnet), a 24 hour guard who calls out experts
  • Redundant servers for critical services (DNS, database services, Pnfs, dCache server)
  • Over-dimensioned tape system disk cache to cover for tape hardware filures
  • Monitoring - creating a dashboard
  • A pool of ~10 people agree to take turns at checking the monitoring information and call out experts
  • SMS alarm infrastructure in place
  • Early in 2007 a best efforts system will be put in place
  • But there is no on-call service for the basic campus services
  • Questions - what about implementing dynamic re-distribution of data


  • TRIUMF is operated 24 X 7
  • The TRIUMF control room can be called and will call out experts
  • Hardware redundancy and failover
  • Use Ganglia for hardware and system monitoring
  • System log monitoring with automated mail notification to experts - but no automatic paging
  • LAN/WAN monitored 24 X 7 with paging of expert
  • SFT tests monitored
  • dCache monitored manually
  • At present 16 hour coverage, 3 people growing to 6 dedicated to ATLAS Tier-1
  • Planning now for full coverage from January 2007, using automated paging and cell phone rotation.

General note

  • The experiments will have 24 X 7 coverage at least during the run. Although the people are likely initially to be at CERN later they may be locate


6.      Summary of New Actions




15 Sept 2006 - H.Renshall and J.Shiers will distribute to the MB a summary with the mandate, input, output and participants of the OPS, SCM, RSM meetings.



15 Sept 2006 - The MB should send feedback about the proposal to H.Renshall.



11 Sep 2006 - Les Robertson will propose to the Overview Board to show to the C-RRB the cpu, storage utilization, installed capacity, perhaps the pledge, with appropriate caveats, along with summary usage information by VO.


The full Action List, current and past items, will be in this wiki page before next MB meeting.