LCG Management Board
Tuesday 7 July 2009 16:00-18:00 – F2F Meeting
(Version 1 – 17.7.2009)
A.Aimar (notes), D.Barberis, I.Bird(chair), K.Bos, M.Bouwhuis, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, S.Foffano, Qin Gang, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, M.Litmaath, P.Mato, P.McBride, G.Merino, R.Pordes, H.Renshall, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout
M.Dimou, D.Kelsey, P.Mendez Lorenzo, R.Quick
Mailing List Archive
Tuesday 21 July 2009 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
No comments received about the minutes. The minutes of the previous MB meeting were approved.
1.2 Approval of Security Policy Documents (VOMembershipManagement-v3.7.pdf; VORegistrationSecurity-v2.6.pdf) – D.Kelsey
D.Kelsey summarized the process followed and noted that the EGEE procedures were not clear about how long user data (logs, etc) will be stored: the final agreed wording now says “one year” which is the period that in all countries does not requires special procedures.
Both Security Policy documents were approved by the WLCG MB.
R.Pordes noted that the policies agreed apply to the EGEE Sites of the WLCG only, not to the OSG. OSG have other policies in place.
D.Kelsey agreed to add to the document the comments received from OSG.
1.3 GGUS Notification to OSG Sites (Slides) – M.Dimou
M.Dimou presented the notification process of GGUS ticket to OSG Sites. She also provided links to documentation on the whole set o definitions and background information. See https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#GGUS_to_OSG_routing
The goal is to have for the OSG Sites, like for the EGEE Sites:
- Contact email for each OSG Site
- Emergency email for the OSG Tier-1 Sites, for alarm purposes only.
GGUS works by support units not by Site or by project. Therefore “OSG” was defined as “support unit” in the GGUS portal.
All information from OSG was in flat files since 2008 not in a format useable by in GGUS. Now this must be changed and the information must be extracted directly from the OIM database.
The work is progressing and there are no problems to report.
1.4 OSG Tier 1 Contact Info (Slides) – R.Quick
R.Quick presented a summary of the progress since December 2008.
The GGUS Ticket Exchange with OSG is working smoothly.
- US ATLAS: A Ticket Created in GGUS since June 1st: (23 tickets). The average Time of Ticket Creation 2.25 Minutes. From submission in GGUS to Creation at Tier 1. The spread between 1 and 6 minutes. One Tier 3 ticket took ~4 hours, which is also acceptable.
- US CMS is not using direct routing, though they were invited to talks in December when OSG put this in place for US ATLAS. During the same time period only 2 US CMS tickets were created.
OSG Procedures for Alarm Tickets
They have added an optional SMS contact field to all of OSG contacts in OIM. This is useful in the GGUS ALARM situation but also has potential for future use in OSG procedures, as well as expansion into Tier 2s if a future need arises.
Once this is in place and explained to the Tier 1 contacts, they will be allowed to choose to populate this field as they see fit. This can be worked out amongst the OSG Tier 1 managers and the WLCG VOs.
An address will be given to GGUS to query this field for Tier 1s, with proper authentication.
M.Dimou agreed that the turnaround time is adequate, but what is important is that the data is not available vie flat files but with DB queries. The address is already available but is at the Tier-1 and VOs to agree on using this procedure.
R.Quick agreed that the contact information can be available programmatically without any problem.
M.Ernst confirmed that there is an agreement and BNL will fill in their details once they have agreed internally.
I.Bird concluded that both GGUS and OSG have done their part. Now are the OSG Sites and VOs that need to provide the correct contact information in the right places.
2. Action List Review (List of actions)
· 5 May 2009 – CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc.
L.Dell’Agnello stated that CNAF completed their internal tests and will send a report to R.Wartel. The Italian ROC security manager will also send his report.
· Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar
Not done by: DE-KIT, FR-CCIN2P3, NDGF, NL-Tier-1, US-FNAL-CMS
Sites can provide what they have at the moment. See http://sls.cern.ch/sls/service.php?id=WLCG_Tier1_Tape_Metrics
Sites should send URLs to existing information until they do not provide the required information.
No progress since last week. All Sites reported that they will probably not be able to provide the XML file before the end of the month.
M.Kasemann noted that live metrics would be also very useful.
· A.Aimar finds how to display directly SLS information from all Sites, without using the SLS interface, for July’s F2F Meeting. And also which metrics Sites are currently displaying.
Examples from A.Di Girolamo where the
same metric for all Sites is aggregated in a single web page.
· M.Schulz should report about the status of the glExec patch on passing the environment
Done in this meeting.
Operations Weekly Report (Slides)
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting.
All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
Was a quiet week again; with decreasing participation to the daily meetings (see slide 3). Hopefully is because of the holidays season and not because Sites only have interest during short test periods like the STEP09 tests.
R.Tafirout noted that for TRIUMF the meetings take place at 6h00 AM and, as agree with J.Shiers, they only participate when it is really necessary.
No alarm tickets this week. But a few incidents leading to SIRs.
- ATLAS post-mortem on PVSS/COOL
- FZK posted a post-mortem explaining their tape problems during STEP09
- RAL scheduled downtime for move to new Data Centre
- ASGC seems to be recovering
Slide 5 shows an example of the kind of typical tickets for a VO (in this case LHCb, as example):
Jobs failed or aborted at
gLite WMS issues at Tier 1
Data transfers to Tier 1
failing (disk full)
Software area files with root
CE marked down but accepting
Slide 6 shows the availability plots, one can see that:
RAL is red for CMS, white for
LHCB and green for ATLAS and ALICE.
3.2 Service Incidents Reports
PVSS Incident (ATLAS Post-mortem). (slides 7-9)
Sunday afternoon 27-6 V.Khomutnikov from Atlas reported to the Physics DB service that the online reconstruction was stopped because of an error was returned by the PVSS2COOL application (on Atlas offline DB). The error started appearing on Saturday (26-6) evening.
FZK tape problems during STEP09 (slide 10)
- An update to fix a minor problem in the tape library manager resulted in stability problems
- Possible cause: SAN or library configuration
- Both were tried and problem disappeared but which one was the root cause is unknown
- Second SAN had reduced connectivity to dCache pools: not enough for CMS and ATLAS at the same time à CMS asked to not to use tape
First week of STEP09
- Many problems: hw (disk, library, tape drives), sw (TSM)
Second week of STEP09
- Added two more dedicated stager hosts resulted in better stability
- Finally getting stable rates 100 – 150MB/s
A.Heiss added that FZK will repeat the STEP09 tests in agreement with ATLAS and CMS. The intervention on tapes was supposed to be transparent and the Experiments were informed.
I.Bird replied that the changes were never announced to the WLCG and they should have been. Was agreed that all interventions should be announced and reported to the whole WLCG project. Announcing it to the Technical Advisory Board at FZK is not enough and is not what was agreed.
I.Bird, J.Shiers and A.Heiss agreed to clarify the issue after the meeting.
RAL scheduled downtime for DC move (slide 11)
Friday 3/7: reported still on schedule for restoring CASTOR and Batch on Monday 6/7.
Despite presumably hectic activity with equipment movements, RAL continued to attend the daily conf call
Planning and detailed progress reported at: http://www.gridpp.rl.ac.uk/blog/category/r89-migration
ASGC Instabilities (slide 12)
ATLAS reported instabilities at the beginning of the week. CMS allowed the full week grace period for ASGC to recover from all its problems. Both ATLAS and CMS specific site tests changed from Red to Green during the week.
On Friday 3/7 Qin Gang reported that tape drives and servers are online.
4. Update on the HEP SSC Preparation (Prelim. Call Info.; Slides; Document) – J.Shiers
J.Shiers reported on the workshop in Paris about the preparation of the HEP SSC proposal.
Details are at: https://twiki.cern.ch/twiki/bin/view/LCG/HEPSSCPreparationWiki
Other material is also available:
- The EGI_DS “Blueprint” document describes potential role of “Specialised Support Centres”
- Within the context of EGEE NA4, several preparation meetings have been held. Most recently: May in Athens, Paris in July. See Indico for agendas and presentations
- In June there was an Information Day in Brussels which clarified the specific areas targeted by this call – as well as possible funds.
- More information on “HEP SSC” was given at the recent OB meeting
4.1 Sections concerning WLCG
In particular the section interesting for the WLCG MB are:
– 18.104.22.168 “EGI”- including “generic” services and operation required by WLCG. (e.g. GGUS, etc – “the usual list”)
– 22.214.171.124 Services for large existing multi-national communities
– The funding for 126.96.36.199 + 188.8.131.52 = EUR25M; a joint proposal is expected
– Some people say / think that there is EUR5M for 184.108.40.206 (AFAIK not written down anywhere) and that the EUR5M should be shared with at least 1 other (than WLCG) large community
– 1.2.3 “Virtual Research Communities” = EUR23M
– Currently 2-3 “SSC” proposals foreseen; ideally should be 1 but is not converging.
– P2: combining Astronomy & Astrophysics, Earth Science, and Fusion;
– P1: combining the training, dissemination, business outreach;
– P0: combining the other scientific SSCs (high-energy physics, life science, computational chemistry and material science, grid observatory, and complex systems).
Our stated plan for the “HEP SSC” is for a EUR10M project over 3 years, 50% of funding coming from EU, dependant on details such as exact scope, partners etc.
Also other possible areas of funding, e.g.
– 220.127.116.11 m/w (separate (important) topic, not this talk);
– Others: probably too much fragmentation: focus on the above 2 (3) areas
Obviously, what we target in the sum of all 3 areas should be consistent and meet our global needs.
The Workshop in Paris showed that the bottom-up approach, i.e. collecting needs from the HEP community and partners, was confirmed. The FP7 information day has helped to clarify how much funding might be available and for which specific purposes.
4.2 WLCG Input Required and Timeline
WLCG must provide input on its needs on each section:
- 18.104.22.168: Services that we should target
- 1.2.3: Goals and work plan of a “HEP SSC”. Is clear that this is “more than LHC”, “more than HEP” – exact scope still to be defined urgently.
Proposed timeline: first draft prior to “meeting” during next week’s MB slot.
- Before end July: Need input from LHC experiments
- 2nd August: Concentrated proposal writing
- September: Reviews and revisions in September
I.Bird noted that WLCG should specify everything that is needed in the proposals. Not relying on other external proposals. Evolution of the current services may imply some development.
J.Gordon asked whether 22.214.171.124 and 126.96.36.199 are in a single proposal and J.Shiers replied positively.
4.3 Proposal Writing Timetable
Into this schedule we must also fit:
- 1-2 meetings with EU commission (in Brussels?)
- Several SSC and other preparation meetings in a wider context
- “SEPT’09” or how the future tests will be called.
- LHC restart preparations
4.4 EGI_DS Status of the Document
Application and Community Support chapter: current draft is 0.7
Plan is to incorporate comments / corrections from yesterday’s meetings, plus further input from Application Communities, well before end July. Versions 0.8, 0.9 will be needed.
A further revision – 1.0 – will be made early September for consistency with draft “EGI” proposals. Specific changes include:
- Manpower required; size of community affected; specific call areas targeted (e.g. 188.8.131.52, 1.2.3)
- Input on Services from large communities, in particular WLCG, plus others
- “SSC” input from HEP (other than WLCG) and other communities
The timescale for writing proposals very tight The transition document should be another document in itself, not cut and paste from a document with a different purpose.
Target last two weeks of August for intensive writing of (184.108.40.206), 220.127.116.11, and 1.2.3 sub-proposals.
September will be used for multiple reviews prior to public presentation during EGEE’09.
Two 2 hour sessions for “HEP SSC”: more than HEP, more than an SSC.
For WLCG, September may well include a “STEP’09” rerun followed / overlapping with preparations for LHC restart and data taking.
Is imperative to respect this timescale and not over commit.
J.Gordon noted that there were partner for other areas but what are the partners in the HEP community?
J.Shiers replied that 3 partners are needed in addition to CERN. INFN and FZK have participated to the meetings. INFN found the past model of collaboration a good setup.
4.5 EGEE09 Sessions
There will be dedicated meetings during the EGEE 09 conference.
– Tuesday 22:
– Wednesday 23:
– Thursday 24:
– Friday 25:
– Suggest a small F2F both Thursday and Friday pm
J.Shiers added that for the moment also all Operations and Middleware Releases are included. Whether they will be in this proposal or another will be seen. But it is important to prepare the full picture of what is needed and then see how it is split into separate proposals if more appropriate.
M.Kasemann asked whether general service and Experiments-specific services will be included. For instance FTS is a basic Service and PheDex is built on top. Are they both included?
I.Bird replied that the proposal is about a thin layer of general services. Services like FTS are in a middleware call, if they will not be it will be included in this call. One should find a way to include the Experiments Services too.
O.Smirnova noted that innovation is an important quality of the proposals.
I.Bird replied that the innovation will be a clear product of the proposal: The support of a European project of the size of the WLCG is already innovative by itself.
5. CMS QR Report 2009Q2 (updated slides) – M.Kasemann
M.Kasemann presented the 2009Q2 quarterly report for CMS.
5.1 Tier-1 and Tier-2 Sites Readiness
The Site readiness is closely monitored for all Tier-1 and most Tier-2 sites: The tools finalized early 2009 and reports and follow-up during weekly Facility Operations meetings. This quarter there were additional meetings to focus on Asian and Russian and Turkish sites.
Substantial improvement is observed for large number sites.
Sites below 60% in March with big improvements until June now are > 80% ready: BR-UERJ, KR-KNU, US-Caltech, ES-IFCA,
AT-Vienna, UK-London-IC, RU-ITEP, UK-Bristol and > 60% ready: IN-TIFR, IT-Rome, RU-JINR, TR-Metu, RU-SINP
The CMS Site Readiness web page is here:
Below is the Tier-1 Sites readiness over one month in June 2009.
Compared to March 2009.
For the Tier-2 Sites in June (36 sites > 80%, 10 sites < 60%) and March 2009 (26 sites > 80%, 21 Sites < 60%).
The averages and the “jitter” in these plots are:
• Tier-1: average is 5(+2 or -1), the spread is 3 = 60%
• Tier-2: average is 43, spread is 6 + more sites will get ready.
This is still not production quality.
5.2 STEP09 for CMS
The CMS Emphasis was on:
- Tier-0: Data recording in parallel with other experiments
- Tier-1: Tape access, testing simultaneously pre-staging, processing
- Tier-2: Analysis at Tier-2 :Demonstrate ability to use 50% of pledged resources with analysis jobs
- Tier-1 →Tier-1: replicate 50 TB (AOD synchronization) between all Tier-1s
- Tier-1→Tier-2: stress Tier-1 tapes, measure latency in transfer to Tier-2
The final report during WLCG STEP09 post mortem workshop 10/11. July.
Tier-0 Tape Writing
The target of 500MB/s was exceeded in both testing periods.
- Structure in first period due to problems in disk pool management
- Monitoring of tape writing and reading rates per VO can be improved
CMS had 2 weeks of tests, one overlapping with ATLAS’ activity.
Tier-1 Tape Writing
For reprocessing of MC the required tape read rate between 50-250 MB/sec was tested, calculated according to the amount of data to be stored at Tier-1 centre. Overlapping tests at Tier-1 Sites with ATLAS was performed at some centres.
Preliminary results at Tier-1 centres are being studied. Some sites met the metrics every day of the test. Other Tier-1s met the metrics approximately 75% of the time. At some centres the configuration and the overall stability has to improve, tests have to be repeated. For instance at FZK.
The bottlenecks were generally in the underlying tape systems and not in the ability of CMS to request data staging.
CMS is carefully checking with the sites the implementation of tape families at each Tier-1 centre, which tends to concentrate data needed together on the same physical tapes.
Additionally CMS will be more actively managing the files expected to be on disk with. Data Operations
Tier-1 Staging and Re-processing
Exercise rolling re-reconstruction:
- Pre-stage 1 day worth of data and process it the next day
- Minimize disk consumption
- Maximize CPU efficiency because input is on disk
Pre-staging was used for the first time in this planned and organized manner.
Very good performance of all sites under multi-VO load. CPU efficiency comparing w/o pre-staging measured, to be followed up. But not for all Sites. On some Sites pre-staging did not provide any improvement.
Tier-1 Transfer Tests
Emphasis on Tier-1→Tier-1 transfers exercises:
- Use AOD synchronization between Tier-1 sites after re-reconstruction
- Synchronize 50 TB of data between Tier-1 sites, two tries
All sites participated. Transfer test was completed satisfactory and the links provided good rates between all sites
Analysis Tests at 49 Tier-2 + 8 T3
The test was the measure percentage of analysis pledge used with standard analysis job: reading data, no stage-out to other Tier-2.
CMS was capable of filling majority of sites at their pledges, or above. They used in aggregate more than the analysis pledge. Roughly 80% was the success rate, and 90% of failures are read errors. A total of almost a million jobs.
5.3 Preparation for Data Taking
Computing shifts will start for CRAFT cosmics running (July 22) with presence at CERN, FNAL or other CMS centre required.
For description see: https://twiki.cern.ch/twiki/bin/view/CMS/ComputingShifts
5.4 SL5 Migration
The GDB recommends starting migration after STEP09. CMS is recommending to all CMS sites to migrate as soon as possible. The proposed migration deadline of September 1st, 2009.
The current status is:
- 6 Tier-1 Sites will be done by September
- IN2P3 50% by Sept., 100% by end of 2009
- 25 Tier-2’s will be done by September, 6 not, answer pending from the rest.
F.Hernandez added that IN2P3 will provide by September a test setup with several CEs for the Experiments to test and approve before installing. After this during September all resources will be migrated.
J.Gordon added that most Tier-1 Sites are doing the same: First a test set for the Experiments and then the whole migration.
Coordinated by Facility Operations, site polling at :( https://twiki.cern.ch/twiki/bin/view/CMS/Poll-Tier-1Tier-2-SLC5) together with Offline for software validation is checking that Sites migrate to SL5.
CMS SLC5 migration plan at CERN
- Migrate 10% by today/tomorrow, then: checked by CMS Tier-0 teams
- If ok, migrate the rest of CMS Tier-0 + CAF by July 19.
CMS SL5 migration documentation is here:
5.5 Future Tests and Production
As a STEP09 follow-up there will be targeted tests at some Tier-1 Sites to verify that problems are solved. If needed together with ATLAS
An analysis end-to-end test is planned, its goals are:
- For computing: verify that all processing steps are “luminosity-calculation save”, i.e. no un-accounted loss of events
- These are a series of functional tests at Tier-0, Tier-1 and Tier-2.
Major MC production for 2009 LHC data is about to start:
- CMSSW release expected in a few days
- Plan for 4 weeks for validation
- An initial sample of ~200M events are required for 2009 analysis
- The plan is to finish production in September 2009.
2-day CMS Global runs performed since March, about every week
- Long Cosmics run starting July 22.
- Monte Carlo Production at slower rate all the time.
STEP09 was a valuable exercise with many tests overlapping with ATLAS and others. More information at the WLCG workshop July 9-10.
Big improvement observed for stability and readiness of Tier2 sites. Tier-1 sites need to finish upgrades, need to show stability. More specific tests will be performed where needed.
An analysis-end-to-end test is planned in late summer. Large MC production is prepared, will start in a few days with new version of CMSSW.
6. GlExec Update (Slides) – M.Schulz
Pilot services have seen first tests by LHCb and ATLAS at NIKHEF and Lancaster. The problem with expired VOMS attributes has been solved on the pilot.
LHCb reported first tests of the environment conservation scripts; they seem to work and the Integration with the Dirac framework started.
Latest gLite production release (update 49) includes it. http://glite.web.cern.ch/glite/packages/R3.1/updates.asp
SCAS, glExec released to production, with known issues and the script is still in certification
The known issues below are addressed in Patch 3084 and 3050. Both are ready for certification
- User ban plugin in LCAS will not work
- Malformed proxies crash glExec
- 5 or more proxy delegations make the voms-api segfault
The wrapper has been given to OSG for inspection and will come back next week with feedback. Still tests with the production infrastructure are needed. The pilot is available: LHCb reports that the configuration of SCAS and glExec is complex and error prone. SA3 follows up on this.
There is a GlExec-WN packaging problem. For sites that install gLite on shared file systems glExec has to be provided independently. There was a phone conference on July 6th and the general consensus (on 2.5 routes), but no timeline up to now. M.Litmaath can provide more details if needed
Integration with ARGUS, the new auth framework, will also be tested but is a longer term goal.
It is not sure that there will be an August F2F MB meeting. Depends on whether the August GDB will take place. Most likely not.
8. Summary of New Actions