LCG Management Board
Tuesday 4 July 2006 at 16:00 – Face-to-face meeting at CERN
(Version 1 - 09.07.2006)
A.Aimar (notes), D.Barberis, I.Bird, K.Bos, M.Branco, N.Brook, T.Cass, Ph.Charpentier, L.Dell’Agnello, B.Gibbard, J.Gordon, V.Guelzow, F.Hernandez, J.Knobloch, H.Marten, G.Merino, Di Quing, L.Robertson (chair), M.Schulz, J.Shiers, J.Templon
Tuesday 11 July 2006 at 16:00
1. Minutes and Matters arising (minutes)
1.1 Minutes of the Previous Meeting
Received the following comments:
- H.Marten suggested to the GDB to include Jos van Wezel (FZK/GridKa) into the group that is going to “look into the implications of the move to SRM 2 on the sites”. Updated the list in the minutes.
- RAL had expressed the interest in participating to the experiments’ meetings during their SC activities. Actually probably RAL is already participating to some of those meetings. Minutes of the previous meeting updated.
1.2 QR 2006Q2 distributed (complete by 10 July) (more information)
The QR reports for the quarter that just ended were distributed on the 3rd July 2006 and they should be completed and mailed back to A.Aimar before the 10 July 2006.
1.3 Experiments/sites meetings
The participation of the sites to the experiments’ meetings was briefly discussed again. The meeting of the sites with ATLAS were not successful due to some technical problems (no phone for the conference). Therefore, during their SC activities, ATLAS has been contacting the sites directly when necessary.
LHCb would like to organize a meeting with their Tier-1 sites in order to make sure that all requirements and operational issues are clear to all sites participating to the LHCb’s challenge activities.
1.4 Other matters arising
A few other issues were discussed at the beginning of the meeting:
- The WLCG Operation meeting will be moved to Wednesday from September.
- There is a problem for LHCb accessing data locally at SARA. This will be followed up by H.Renshall. This topic was also discussed at the GDB the following day.
2. Action List Review (list of actions)
Note: Actions in RED are due.
23 May 06 - Tier-1 sites should confirm via email to J.Shiers that they have set-up and tested their FTS channels configuration for transfers from all Tier-1 and to/from Tier-2 sites. Is it not sufficient to set up the channel but the action requires confirmation via email that transfers from all Tier-1 and to/from the "known" Tier-2 has been tested.
Not done: ASGC, BNL,
FNAL, NDGF and TRIUMF.
FNAL: FTS server is set up and they have tested a few channels. Still needs a meeting with G.McCance to define the required set-up.
30 May 06 -
31 May 06 - K.Bos should start a discussion forum to share experience and tools for monitoring rates and capacities and to provide information as needed by the VOs. The goal is then to make possible a central repository to store effective tape throughput monitoring information.
13 Jun 06 - D.Liko to distribute the Job Priority WG report to the MB.
Not done yet.
27 Jun 06 - K.Bos proposes a group that will look into implications of the move to SRM 2 for Tier-1 and Tier-2 sites.
Done. The names of people that will take part in this group are:
· Jan van Eldik (CERN)
· Mark van de Sanden (SARA)
· Artem Trunov (CMS and Alice, IN2P3)
· Lionel Schwarz (IN2P3)
· Adrià Casajús (PIC)
· Jos van Wezel (FZK/GridKa)
· Ruth Pordes as a place holder for possibly Eileen Berman and/or Frank Wuerthwein (FNAL)
30 Jun 06 - J.Gordon reports on the defined use cases and policies for user-level accounting in agreement with the security policy working group, independently on the tools and technology used to implement it.
3. Report on the LCG Services Review (more information) - V.Guelzow
This is the initial report of the Internal LCG Services Review. A final review document will be produced before the end of July.
31 Jul 2006 - V.Guelzow and the review team produce a report with the general recommendations of the internal review. In addition specific recommendations will also be distributed to each Tier-1 site.
The reviewers were of the LCG Service were:
- John Gordon (RAL)
- Volker Guelzow (DESY) Chair
- Alessandro de Salvo (INFN Rome)
- Jeff Templon (NIKHEF)
- Frank Würthwein (UCSD)
The mandate of the review is at this URL: http://www.cern.ch/lcg/documents/mb/service_review_mandate_jun06.doc
And special attention was asked about the following topics:
- “state of readiness of CERN and the Tier-1 centres, including operational procedures and expertise, 24 X 7 support, resource planning to provide the required capacity and performance, site test and validation programme;
- the essential components and services missing in SC4 and the plans to make these available in time for the initial LHC service;
- the EGEE-middleware deployment and maintenance process, including the relationship between the development and deployment teams, and the steps being taken to reduce the time taken to deploy a new release;
- the plans for testing the functionality, reliability and performance of the overall service;
- interoperability between the LCG sites in EGEE, OSG and NDGF;”
A questionnaire was sent to the Experiments and to the Tier-1 sites asking which were their main concerns, about the setup of the experiments/sites, about the distribution of information, middleware, and interoperability. Other input to the review was found in documents of the MB, the CRRB and from other sources.
The review took place on 8th and 9th June 2006 at CERN (see agenda: http://agenda.cern.ch/fullAgenda.php?ida=a062385 ):
- 1st day: it was 25 minutes for each Tier 1, for additional questions;
- 2nd day: there were reports on the gLite/OSG roadmap, interoperability with OSG and NorduGrid, a post-mortem of the gLite3 release and a summary of the SRM/Castor review that was taking place during the same days.
The report is being written and will be ready by the end of July.
3.2 Overall Comments and Summary
One general issue is the diversity among the Tier-1 sites in terms of:
- experience and background in providing services to HEP experiments
- technology selected for providing the services varies at each site
- funding and staffing are very unequal
- the number of experiments to support goes from one to all four experiments for some sites.
Not all sites have reached the same level of readiness that is needed at the LHC start-up. The key factors of concern are the organization of off-hours services, funding issues and communication problems between sites and the experiments.
If the procurement is postponed too long there are risks that scalability problems may be discovered too late, and that some sites may encounter difficulties in an over steep ramp up of resources.
The associations between the Tier-1s and Tier-2s was unclear at the time of the review. This is now being clarified with the experiments in terms of data transfer. The organizational support is defined by the grid infrastructure projects to which the Tier-2 sites belong. The support from Tier-1 to Tier-2 and Tier-3 is not completely clear in term of requirements from the experiments.
Experiments expressed the concern that the tests and the service challenge preparation is late and too often at the sites, after the tests, the equipment is dismantled and used for other site activities. From now on the set-up for the Service Challenge should stay and become the operational production facility available at the Tier-1 sites.
3.3 Specific Recommendation: Communication
There is the need of really improving communication between experiments and sites:
- Clear (and redundant) contact persons (e.g. liaison officers) have to be nominated on both sides. It should always be possible to contact responsible people, even during nights, week-ends or holidays. It is important that there is always some responsible person – not just a phione number or email address.
- Clear/precise information from the experiments is needed on their plans and status, and this should be better structured and understandable to sites - not spread over several web and wiki pages.
- Web based monitoring pages covering operational issues should be made available by the experiments to the sites. In this way the sites can see whether an experiment is detecting problems with one or more sites.
- The Monday meeting should become the central operations meeting. Some Tier-1s would like more focus on resolving problems than on reporting past issues. The experiments should announce their weekly planning at the meeting. Time zones difference can be also a problem but a better distribution and organization of information should be developed to compensate this issue.
- GGUS should be used for all problem reports. This is well accepted by the users, even if better email notification and users grouping is still needed.
Concerning the Monday meeting, some sites find the weekly reporting not very useful and do not find interesting the discussion of specific problems at individual sites. They would like more discussion on the general progress and problem-solving issues. Smaller issues should be solved by the site and the grid operators and mentioned to the meeting only if they could help other sites in similar cases. Otherwise such incidents should simply be logged in the written report. From the 13 September the meeting will be moved to Wednesday.
The sites/experiments contact persons should be redundant in order to have someone responsible for responding to important and urgent problems at any time. Also experiments should have a single contact point in order to provide a unique liaison and a consistent view of the experiment’s requests.
Sites and experiments should define a clear structure for providing clearer information and unique liaisons and information contacts.
Experiments should provide a simple presentation of their monitoring information to the sites.
3.4 Specific Recommendation: 24 x 7
A 24 x 7 service is needed in the sense of pro-active monitoring of key services with a system for raising alarms for problems that require immediate action. Automated tools are clearly better for such purposes, but every site has different solutions.
The reviewers proposed the use HEPIX as the forum in which to start an initiative to work on common solutions for monitoring, alarms and out of hours support.
Sites should exchange more information about monitoring, alarming and 24x7 support in the framework of HEPIX.
3.5 Specific Recommendation: Management
The funding situation is not clear at every centre and should be followed up carefully.
Sites need clear and realistic requirements from the experiments in order to optimise timing of purchases.
On the other hand sites should not postpone too late their purchases in order to avoid an unfeasible ramp up when the resources will be needed.
Another risk pointed out by the reviewers is that at some centres critical work and duties are currently carried out by temporary staff on short term contracts.
3.6 General Comment: Middleware
The introduction of gLite 3 was not straight forward but the situation was exacerbated in some cases by an “emotional” anticipation of problems, even before trying to install gLite 3. Also there were many complaints expressed verbally but not substantiated with specific error reports through GGUS tickets.
The post-mortem presentation of the gLite3 deployment was very much appreciated by the reviewers.
The reasons perceived as the causes of the difficulties of the gLite deployment were:
- Lack of manpower at the sites, or lack of experience due to the high turnover of temporary staff
- Lack of the understanding of what actually should be deployed and at which priority (e.g. Which CE component? gLite Classic CE? The new gLite CE?).
- Adaptations and localization of the installation to specific tools used at the sites took longer than expected.
- Coordination with the needs of non-LHC experiments and other VOs caused delays and conflicts in some cases.
The review team made the following recommendations to the middleware developers:
- Arriving at a stable and solid middleware is the number one priority – much more important and urgent than implementing new features or of using newer technologies
- Improving the level of testing and verification of the quality of the software should be a top priority for the developers
- The documentation is not well structured and should be more user-oriented
- Full VOMS functionality is needed by the experiments
- Error reporting by users needs to be improved.
- There is the need for better software interfaces to logging, diagnostics and service operations.
3.7 Recommendation: TCG
The current feedback loop via SA3/SA1 to JRA1 takes too long. The TCG should have a more important role in this.
The representation of operational issues is not adequate in the TCG. The Tier-1 sites should be represented in order, for instance, to be able to express the urgency of providing stable middleware software vs adding new features.
I.Bird noted that in the TCG there are two places for the sites but only one of them has been filled (by a representative of a Tier-2 site). Nobody from the Tier-1 sites is participating to the TCG and the free place should be filled urgently.
Tier-1 sites should nominate their representative(s) to the TCG.
3.8 Recommendation: Interoperability
Experiments seem satisfied by the simple possibility of submitting jobs independently to each grid. Very little manpower is invested in working on interoperability at the moment. If more is needed, experiments should express what their needs are in term of interoperability.
Experiments should express what they really need (not “nice to have” requests) in terms of interoperability.
3.9 Overall Recommendations
The reviewers summarized their recommendations and suggested:
- Repeat the review of the Tier-1s in Spring 2007
- Make clear plans for 24 x 7 support
- Check the feasibility of the ramp up at each site
- Clarify all funding and staffing situations
- Improve testing of the middleware and the way in which requests are routed via the TCG to the developers.
On the review itself it is recommended that next time 1h sessions are planned for each site and that more time is devoted to the middleware issues.
3.10 Final Comments
The Review chair asked the opinion of the MB on whether the report should have a general section, but with specific and confidential comments sent directly to each site. The Tier-1 sites said that they expect specific comments sent to them directly, and not made public.
K.Bos noted that many recommendations were previously known problems and that it would be useful to have a permanent group that checks the progress at the different Tier-1 sites. J.Gordon replied that the goal of the review was not to follow up the current situation but to focus on the readiness and preparation of the Tier-1 sites.
Ph.Charpentier noted that there were no details on the services provided by the sites. J.Gordon replied that mandate and time constraints were more to focus on preparation for 2008, and not, for instance, to check that the current set-up of FTS or Oracle were working
L.Robertson commented that another more detailed review on few specific issues could be held before the end of the year.
4. gLite 3 Status (transparencies) - M.Schulz
· Issues and corrective actions
· Problems at specific sites
This was a short summary on the status of the release of gLite 3.0, on the issues encountered and lessons learned.
Note: Please view the presentation in “Slide View” mode because many slides are just there as backup material and were not presented.
Slide 2 shows what is in the gLite 3.0.0 release. It has all previous components with bug fixes and improvements and two new components: the gLite-WLMS and the gLite-CE.
4.1 Plans and Release of gLite 3.0
The planning started in January (slides 8 and 9) with the release of gLite 3 scheduled to be in operation for June 1st . The execution was consistent with the plans and on the 4th May gLite was released to production and ready for staged deployment in operation. On the 5th June, 8 CEs were installed and more than 50 sites had installed the gLite 3 client libraries.
Two weeks after the release (slide 10) there was the first bug fix patch, (to fix some yaim problems). After that 3.01 and 3.0.2 were certified and released to production between the 6th to the 16th of June.
It will be very important that in the future, for each upgrade, there is the decision of how thoroughly the system should be tested before deploying it.
4.2 Current Status
Currently 94 out of 189 sites (slide 12) have upgraded to gLite 3.0.x. Only 12 sites have deployed the new gLite-CE.
Among the Tier-1 and Tier-0 sites only 2 have installed the gLite-CE and 3 sites have gLIte-WLMS installed.
Foil 13 shows the commands to access the status of the deployment and foil 14 the list of sites that have installed (on the 4th July) the new gLite-CE component.
The discussion focused on how important the new CE is for the experiments and the fact that if they do not test this component it will never be debugged and improved. Another issue discussed is that often the middleware developers do not know enough of the real context in which their component is used.
4.3 Main Problems Encountered
Slide 17 shows the problems faced after the release:
- Most of the sites were unprepared in spite of the early preparation (Jan 2006) of the planning schedule for the release.
- The release and upgrade notes confused many sites, which did not have a clear understanding of what they should actually install (which CE, etc)
- The staged rollout on the Tier-1 sites did not provide any gain in time and actually stopped some sites that would have installed sooner. And also some small-medium sites that usually install eagerly were stopped by the staged deployment.
- The time for the site localization was about one month for a large site, in order to adapt to the local batch system, and test the installation properly.
In summary (slide 21) gLite 3.0 was released in time and contained all the agreed components:
- The merging of the LCG and JRA1 component was not easy because of differences in the organization of the projects (CVS, build, tracker, etc).
- The current testing is not covering adequately the system
- The communication with the sites and the rollout needs to become clearer than it has been.
- Documentation needs to be improved, but is currently completely provided by the deployment team. The middleware team only provide the detailed API, which is not useful to the sites.
The different configuration systems and build tools of the LCG and JRA1 middleware components are a problem. Wrapping the installation tools with one more layer (Russian Doll approach) did not simplify the task and a new strategy will have to be defined.
5. ATLAS SC4 status (transparencies) - M.Branco
This presentation describes the status and the problems encountered by ATLAS in the first two weeks of their Service Challenge exercises.
5.1 Service Phase
The main exercise (slide 3 and 4) is coordinated by L.Goossens and involves the Tier-0, Tier-1 and the volunteering Tier-2 sites with the goal of reaching the nominal rate at all Tier-1 sites.
Phase 1 started on the 19th June for 3 weeks, while a second phase will be executed in September.
The Tier-0 achieved the nominal rates without export to Tier-1 sites already in January. The goal now is to run at the nominal rate steadily and under extensive monitoring (slide 7). Examples of the accounting and monitoring are at slides 8 and 9.
5.2 DQ2 Characteristics
DQ2 is the ATLAS Distributed Data Management system which runs on top of the grid data transfer tools.
DQ2 allows the implementation of the basic ATLAS Computing Model concepts, as described in the Computing TDR (June 2005) such as:
- Distribution of raw and reconstructed data from CERN to the Tier-1s
- Distribution of AODs (Analysis Object Data) to Tier-2 centres for analysis
- Storage of simulated data (produced by Tier-2s) at Tier-1 centres for further distribution and/or processing
DQ2 is based on:
- A hierarchical definition of datasets
- Central dataset catalogues
- Data blocks as units of file storage and replication
- Distributed file catalogues
- Automatic data transfer mechanisms using distributed services (dataset subscription system)
The goal of the tests is to run the complete flow (T0 + T1s export) at nominal rate but not providing a constant stream of data to Tier1s (dteam already does it). The goal is rather to provide a realistic stream of data from the Tier0 to Tier1s, with failures, ramp ups/downs, etc in order to test the LFC and FTS ‘job logic’, bookkeeping and monitoring tools, etc.
The log is here: https://uimon.cern.ch/twiki/bin/view/Atlas/DDMSc4
5.3 Daily Progress of the Tests
The tests started on the 19th and on the 20th and they reached the rates of last year. The first problems were seen with the VOBoxes hardware. Some sites were stable (BNL, IN2P3, PIC, SARA, CNAF, TRIUMF) while others were unstable (RAL, ASGC, FZK). The LFC was overloaded with too many parallel requests and the monitoring was not scaling, producing too many log/error messages in the logs.(slide 14 and 15).
After the power cut at CERN of the 22nd the system had a good and fast ramp up (slide 17) with these few specific problems:
- ASGC not working properly
- FZK with slow VO BOX (only dual processor Pentium III)
- No AODs produced (200MB/s aggregate)
- RAL: occasional problems: reason not understood for now
- Achieving ~ MoU rates (but without AOD)
- Occasional file transfer problems (GGUS tickets submitted)
On the 26th to 28thth June (slide 19 and 20) ATLAS did some software updates of DQ2 in order to activate monitoring and to solve the problems of overloading FTS with too many requests.
On the 30th June there was a major Tier-0 problem caused by LFC.
The Tier-2 tests went well with some of the sites (Lyon and CNAF). But as the Tier-1 sites also serve the VOBoxes to the Tier-2 there was an even higher overload on the VOBox nodes.
At the time of the MB meeting there were more fixes released with better monitoring and automatic recovery in place (slide 23). And all Tier-1 sites were active all the time.
Problems with a few sites as mentioned (ASGC storage problems, FZK limited h/w, CNAF slow write to tape).
If possible ATLAS would like to continue the exercise in order to gain as much experience as possible.
General problems of communication with the ATLAS-sites contacts, site technical contacts and the SC team.
GGUS is working but often the replies need to be found by many people working together, and therefore chatting or IRC are often more effective.
ATLAS has sustained ~ 500 MB/s for several hours on a few occasions: This means that:
- all components are working together (from Tier0 to Tier1 export)
- it is close to achieving the goal of the nominal rate
ATLAS could reach the throughput rate by making the exercise less realistic but this is not interesting. As a test it is much better to rather have a realistic, even if more problematic, flow.
The situation is considerably better than SC3, even if is not very stable for now. Usually problems are immediately understood when they occur, and this is quite a good sign.
L.Robertson said that the request of ATLAS to continue should be analyzed and he encouraged this. The fact that some sites will be able to provide a more limited support is a good exercise to test the staff replacing the site experts. J.Shiers said that he saw no difficulties of scheduling with other experiments if ATLAS continues data distribution. This should be decided by the weekly coordination meeting.
6. Scheduled Downtime and Availability Measurements (10') ( more information )
Postponed to a future MB meeting.
8. Summary of New Actions
Sites and experiments should define a clear structure for providing clearer information and unique liaisons and information contacts.
should provide a simple presentation of their monitoring information to
should exchange more information about monitoring, alarming and 24x7
support in the framework of HEPIX.
sites should nominate their representative(s) to the TCG.
Experiments should express what they really need (not “nice to have” requests) in terms of interoperability.
The full Action List, current and past items, will be in this wiki page before next MB meeting.