LCG Management Board

Date/Time

Tuesday 02 December 2008 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=45195

Members

http://lcg.web.cern.ch/LCG/mb.htm

 

(Version 1 – 05.12.2008)

Participants

I.Bird (chair), T.Cass, Ph.Charpentier, M.Ernst, I.Fisk, S.Foffano (notes), F.Giacomini, J.Gordon, A.Heiss, F.Hernandez, M.Lamanna, P.Mato, G. Merino, S.Newhouse, A.Pace, B.Panzer, D. Qing, H.Renshall, M.Schulz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 9 December 2008 16:00-17:00 – Face to Face Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments on the minutes. The minutes of the previous MB meeting were therefore approved.

 

2.   Action List Review (List of actions
 

·          Status of SCAS

Reviewed as agenda item of the meeting (point 6.)

·         SRM V2 in the GridView Reports

The MB confirmed last week the move to use SRM V2 for the GridView Reports on 1st December. I. Bird reported it had been done i.e. the SRM V2 results replace the SRM V1 and SE previous tests. No feedback had been received apart from Rutherford who confirmed it was ok, therefore I. Bird assumed everybody was prepared for this.

 

F. Hernandez commented that the link on the MB action list included results of the SE tests and the SRM V1 therefore it was difficult to compare if results were coherent.

 

New Action:

09 Dec 2008 – I. Bird will check with the SAM team and report back to the MB.

·         VO Box SLAs

Waiting for CMS follow-up at IN2P3. As CMS was not present this will be followed up next week.

 

I. Bird asked for progress from NL-T1 and NDGF on their SLAs.

J.Templon confirmed it was on the T1 meeting agenda for 3rd December therefore hoped for progress as a result of the meeting.

O.Smirnova reported that their SLAs were on the board of the NDGF Directors and she expected they would be released soon. For review at the MB of 9th December 2008.

·         Sites to comment the proposal by F. Donno for reporting installed capacity

For review at the MB of 9th December 2008.

·         Comments on High-Level milestones

For review at the MB of 9th December 2008.

·         Dataflow information from the experiments

As only LHCb has provided this and they were the only experiment attending the meeting, this was also left for review at a future meeting however

 

I.Bird reminded the other experiments that this information should be provided.

·         GCC 4.3 co-ordination and timescale

As a matter arising, M.Schulz reported that the information flow about what is discussed at the LCG MB is not working efficiently. Following a recent meeting of the Architects Forum, it became clear that the people doing the port to SLC5 have been taken by surprise despite the several discussions in the LCG MB and in the Physics Service Meeting on the subject in early Spring.

 

P.Mato replied that there was no formal decision about when the migration to SLC5 would happen.

T. Cass disagreed stating this was discussed in January or February 2008 with an option given to the experiments about doing it or not, and finally an agreement to move during the current shutdown.

I. Bird confirmed this and added there had been several comments at a recent MB about making sure that things would work with GCC 4.3 however it seems there is little activity in the experiments and Applications Area in providing things on 4.3.

P.Mato contested this stating that everybody is working on adapting things to work with the 4.3 compiler however for the experiments to validate it, the lower level packages need to be provided.

 

I.Bird stated that confusion has come from the sense of urgency injected in the MB discussion a few weeks ago. Better coordination is needed with an agreement on the timescale about what should be ready by when.

P.Mato replied that within weeks the middleware should be ready, which would then give the experiments some months to test.

M. Schulz confirmed that the first compatibility libraries have been ready for almost 1.5 months but they have not yet been tested indicating that synchronization is not working effectively.

 

Ph.Charpentier remarked that such a technical discussion should take place at the Architects Forum.

I.Bird stressed that the issue was not a technical one, but one of co-ordination of which version of which compiler on what timescale.  

Ph.Charpentier mentioned that the middleware comprises different parts and recalled that the libraries for LFC, gfal, CASTOR, DPM and DCache are those really needed.  I. Bird asked when the experiments will be ready to test this.

P.Mato confirmed in the Applications Area releases for January are currently being built including SLC5 with the GCC 4.3 compiler. As soon as the library of the middleware client software is available it can be used and testing by the experiments can begin.

P.Mato reminded the MB of the strategy he had already presented with 3 parallel ongoing tasks of different priorities.

 

M. Schulz commented that it had been presented at the Architects Forum as if the unavailability of client libraries for 4.3 was blocking the experiments from making progress.

P.Mato and Ph.Charpentier contested.

I.Bird concluded that there is confusion which must be clarified and reported back to the MB.

 

New Action:

16 Dec 2008 – P.Mato and M.Schulz to follow-up on this discussion, agree on co-ordination and timescale and report back to the MB.

·         VO Specific SAM tests

Not completed.

 

I. Bird reminded everybody to update the information on the Wiki page, noting the dates for ATLAS and LHCb were updated but not for ALICE and CMS.

 

3.   LCG Operations Weekly Report (Slides, Daily meetings) – J.Shiers

 

J. Shiers presented a summary of status and progress of the LCG Operations through the week.

3.1      Summary of the Week

Continuing problems with the CASTOR service at ASGC related to the Oracle service behind. The problems have been resolved however lessons can be learnt about reporting. The only other problem was a cooling problem already reported at last week’s MB, and a minor VOMS incident, the post-mortem of which will be reviewed at next week’s MB.

3.2      ASCG CASTOR and SRM Services

CASTOR and SRM services were effectively down for 1 month largely due to a series of Oracle problems and ongoing problems trying to recover the situation. 2 weeks ago a clean Oracle install was proposed. In retrospect had this been followed 2 weeks could probably have been gained. The ongoing recommendation is to follow closely the recommended Oracle services at CERN on a standard configuration. ASGC plan to do this which will help future follow-up. From 25th November CASTOR and SRM services were resumed.

3.3      GGUS Summary

A similar number of tickets opened to last week (46 vs. 44) with none opened by ALICE, 1 test alarm ticket from ATLAS and the rest from CMS and LHCb.

3.4      Baseline versions update

gLite 3.1 SL4 patch 2643 was planned for Nov 27th. J. Shiers reminded sites that when releases were put in the production repository it was up to the sites to pick them up and plan deployment.

3.5      Problem Reporting

At the last MB F2F it was agreed that a site had 24 hours in their time zone (working week) to respond to major problems and try to fix them. After 1 day a report is expected at the daily operations meeting with regular updates by email or updating the wiki. This didn’t happen with the recent ASGC Oracle problems, particularly when the problem was resolved. J. Shiers gave a recent example when LHCb reported on Thursday 27th November issues with the pre-staging rates obtained at PIC. This was investigated and Gonzalo joined the meeting on 28th November with an explanation and status update. J. Shiers stressed that monitoring and quickly reacting as PIC did helps to understand the global picture.

3.6      Storage workshops

·         CASTOR F2F meeting at RAL Feb 18-19 (to be confirmed)

·         dCache workshop at FZK Jan 14-15 (agenda)

These workshops have overlapping agendas with the common goal of getting storage services at the main WLCG sites into sustainable, stable operational status as early as possible in 2009. Quite a few people have registered already. Ph. Charpentier questioned if experiments should be present or if it was more site related. J. Shiers confirmed the latter, suggesting a summary at an appropriate GDB meeting which was agreed to by J. Gordon.

3.7       Final Summary

The WLCG Operations Page https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsWeb is linked directly from WLCG home page. The WLCG Operations mailing list is wlcg-operations@cern.ch The WLCG “Service Coordinator on Duty” mailing list is wlcg-scod@cern.ch which is recommended in case Jamie and/or Harry are absent.

 

F. Hernandez questioned the process for the fts deployment, asking if sites will be notified to deploy, or if they have to monitor the status of the software repository. J. Shiers confirmed that it has gone through testing and is now in a state that sites should plan to deploy it as soon as possible, adding that ATLAS would like it deployed in coming weeks.

 

4.   New CPU Unit (Slides) – G. Merino

 

 

I. Bird invited G. Merino to update the MB on the progress of the Working Group looking at the migration to the new CPU unit.

 

G. Merino explained the process, which was mainly by email, to agree on the dataset with which to work, and to hear suggestions from participants for a conversion factor. There was good participation from the HEPiX WG members and CMS, one email from ATLAS but no input from ALICE or LHCb. One phone conference was well attended, apart from LHCb. During this phone conference agreement was reached between the HEPiX and experiment participants that the old benchmark breaks for new processor architectures therefore the results for new architectures e.g. Harpertown the new Intel architecture, are relevant.

 

When the experiments ran their code on the 8 LXBENCH machines in May, results of which were presented to the MB, all but ATLAS missed benchmarking on the lxbench08 machines (those with the most modern Harpertown processor). Therefore all experiments need to complete running their applications on lxbench08 to enable the dataset on which to base the conversion factor to be completed. For ATLAS the data is available, CMS have confirmed availability before end 2008, for ALICE this represents 1 day of work, to be confirmed that this can be done before end 2008.

 

G. Merino summarized that progress is slow mainly due to lack of participation of participants in the e-mail discussions, particularly ATLAS, ALICE and LHCb. The case for LHCb is particularly worrying since no feedback has been received by e-mail and there was no participation in the phone conference. Ph.Charpentier commented that the LHCb representative had been absent but he would follow this up with him.

 

G. Merino plans to present a conversion factor proposal to the MB in January.

 

I. Bird requested that this is discussed in early January, and pointed out that even if not all experiments participate, the proposal should be made based on the other input received, to which G. Merino agreed.

 

New Action:

13 Jan 2009 – G. Merino to present a new conversion factor proposal to the MB.

 

5.   User Analysis Working Group – M. Schulz

 

 

M. Schulz presented an update of the User analysis Working Group stating that input had been received from LHCb and ATLAS who provided a very detailed list including several links to relevant information. ALICE had promised input soon, and a summary of the CMS analysis model and experience gained with it was expected by the end of the week. Analysis of the material has started and will continue with a first description of the use cases planned for the end of the year. 

 

6.   Updates on gLexec/SCAS and CREAM-CE – S. Newhouse (Slides)

 

 

S. Newhouse the new EGEE3 Technical Director was introduced to give an update on gLexec, SCAS and CREAM-CE.

 

The main milestones were summarised, namely making the CREAM-CE capable of large-scale direct job submission, and in parallel making gLexec and SCAS capable of large-scale use on the worker node in logging only mode. These two work items would be brought together to provide the full functionality of large-scale multi-user direct job submission. The three different threads representing ongoing and future work were described: the gLexec thread which is working for small deployments, but not currently working in the logging only mode, the CREAM-CE thread with a patch coming through for wider scale deployment which is now ready for certification, then more longer term the integration of ICE and its debugging into the WMS. The plan is to have the WMS submitting into CREAM at scale with multi-user pilot jobs ready within about 2 months.

 

Ph.Charpentier questioned why gLexec, SCAS and CREAM were put together as CREAM is not needed for multi-user pilot jobs.

I. Bird commented that gLexec and SCAS are needed for the multi-user pilot jobs and CREAM is needed later for multiple users. 

I. Bird questioned the deployment timescale of gLexec and SCAS, a new patch is expected within 2 or 3 weeks which will hopefully fix the certification issues.

 

M. Schulz detailed the SCAS certification process including functional and stability testing taking the timescale to January 2009.   Authorisation service development activities are happening in parallel with integration and scalability testing planned for January and February 2009 hopefully entering certification mid-March. Deployment is planned to be incremental and therefore certification will also be incremental. The certification and deployment plan for the Authorization Service is still under discussion.

 

I.Bird asked what problem the authorization service is solving for LCG in the next year.

S.Newhouse replied that the aim was to replace the multiple ways of doing authorisation by a single more sustainable solution that would integrate both local and remote (VO) policies. However the policies still need to worked out.

J.Gordon asked if somebody could expose this at the GDB next week, particularly the impact for sites moving to this.

I Bird questioned if deployment of gLexec and SCAS happens early next year, what impact would this parallel deployment have?

J.Templon referred to a recent discussion in the Technical Management Board (TMB http://indico.cern.ch/conferenceDisplay.py?confId=32238) and suggested waiting until the deployment plan has been proposed to the TMB to present at the GDB.

 

J.Gordon reiterated the request for a presentation at the GDB, not an internal technical discussion, but rather a description of what will be affected, which services will be replaced, which will be reconfigured and what the ramifications are. It was agreed a presentation would be made at the January GDB. [It was later agreed to do this at the February GDB meeting].

 

I. Bird reiterated the urgency to get gLexec, SCAS and, in parallel, CREAM working.

F.Giacomini questioned a CREAM certification acceptance criteria list presented at the last GDB. A modified list is being worked on and this will be reported on at next week’s GDB by John Gordon.

7.   AOB

 

 

Ph.Charpentier asked if there would be a pre-GDB next week.

J. Gordon will confirm. [it was later confirmed that no pre-GDB will take place].

 

J. Gordon questioned the timescale and position of the MB following the data-taking official notification.

I. Bird explained that any currently circulating information was not official, however it was expected that at next week’s Council Meeting on 10th December a new schedule would be presented. Following this the MB should discuss in case this changes the current LCG position.

 

I. Bird announced the next mini review of LCG will be a 1 day review on 16th February during the LHCC week.

8.    Summary of New Actions

 

 

New Action:

09 Dec 2008 – I. Bird will check the SRM V2 link in the action list with the SAM team and report back to the MB.

 

New Action:

16 Dec 2008 – P. Mato and M. Schulz to follow-up on the GCC 4.3 discussion, agree on co-ordination and timescale, and report back to the MB.

 

New Action:

13 Jan 2009 – G. Merino to present a new CPU unit conversion factor proposal to the MB.