LCG Management Board |
||||||||||||||||||||||||||||||||
Date/Time |
Tuesday
7 July 2009 16:00-18:00 – F2F Meeting |
|||||||||||||||||||||||||||||||
Agenda
|
||||||||||||||||||||||||||||||||
Members |
||||||||||||||||||||||||||||||||
|
(Version 1 – 17.7.2009) |
|||||||||||||||||||||||||||||||
Participants |
A.Aimar
(notes), D.Barberis, I.Bird(chair), K.Bos, M.Bouwhuis, T.Cass,
Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, S.Foffano, Qin Gang,
J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, M.Litmaath, P.Mato, P.McBride,
G.Merino, R.Pordes, H.Renshall, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova,
R.Tafirout |
|||||||||||||||||||||||||||||||
Invited |
M.Dimou, D.Kelsey, P.Mendez Lorenzo, R.Quick |
|||||||||||||||||||||||||||||||
Action
List |
||||||||||||||||||||||||||||||||
Mailing
List Archive |
https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/ |
|||||||||||||||||||||||||||||||
Next Meeting |
Tuesday
21 July 2009 16:00-17:00 – Phone Meeting |
|||||||||||||||||||||||||||||||
1. Minutes and Matters arising (Minutes)
|
||||||||||||||||||||||||||||||||
1.1 Minutes of Previous Meeting
No
comments received about the minutes. The minutes of the previous MB meeting
were approved. 1.2 Approval of Security Policy Documents (VOMembershipManagement-v3.7.pdf; VORegistrationSecurity-v2.6.pdf) – D.KelseyD.Kelsey summarized
the process followed and noted that the EGEE procedures were not clear about
how long user data (logs, etc) will be stored: the final agreed wording now
says “one year” which is the period that in all countries does not requires
special procedures. Both Security Policy
documents were approved by the WLCG MB. R.Pordes noted that the policies agreed apply to the EGEE Sites
of the WLCG only, not to the OSG. OSG have other policies in place. D.Kelsey agreed to add to the document the comments received
from OSG. 1.3 GGUS Notification to OSG Sites (Slides) – M.DimouM.Dimou presented the
notification process of GGUS ticket to OSG Sites. She also provided links to
documentation on the whole set o definitions and background information. See https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#GGUS_to_OSG_routing The goal is to have
for the OSG Sites, like for the EGEE Sites: -
Contact email for each OSG Site -
Emergency email for the OSG Tier-1 Sites, for alarm purposes
only. GGUS works by support units not by Site or by
project. Therefore “OSG” was defined as “support unit” in the GGUS portal. All information from OSG was in flat files since
2008 not in a format useable by in GGUS. Now this must be changed and the information
must be extracted directly from the OIM database. There were already three meetings with OSG. And OSG
agreed to make this info available in OIM, also at the GDB. [Dec Slides and Jan Slides]. All supporting material
indexed is here. The work is progressing and there are no problems to
report. 1.4 OSG Tier 1 Contact Info (Slides) – R.QuickR.Quick
presented a summary of the progress since December 2008. Overview The
GGUS Ticket Exchange with OSG is working smoothly. -
US ATLAS: A Ticket Created in GGUS since June 1st: (23 tickets).
The average Time of Ticket Creation 2.25 Minutes. From submission in GGUS to
Creation at Tier 1. The spread between 1 and 6 minutes. One Tier 3 ticket
took ~4 hours, which is also acceptable. -
US CMS is not using direct routing, though they were invited to talks
in December when OSG put this in place for US ATLAS. During the same time
period only 2 US CMS tickets were created. OSG Procedures for Alarm Tickets They
have added an optional SMS contact field to all of OSG contacts in OIM. This
is useful in the GGUS ALARM situation but also has potential for future use
in OSG procedures, as well as expansion into Tier 2s if a future need
arises. Once
this is in place and explained to the Tier 1 contacts, they will be allowed
to choose to populate this field as they see fit. This can be worked out
amongst the OSG Tier 1 managers and the WLCG VOs. An
address will be given to GGUS to query this field for Tier 1s, with proper
authentication. M.Dimou agreed that the
turnaround time is adequate, but what is important is that the data is not
available vie flat files but with DB queries. The address is already
available but is at the Tier-1 and VOs to agree on using this procedure. R.Quick agreed that the contact
information can be available programmatically without any problem. M.Ernst confirmed that there is
an agreement and BNL will fill in their details once they have agreed
internally. I.Bird concluded that
both GGUS and OSG have done their part. Now are the OSG Sites and VOs that
need to provide the correct contact information in the right places. |
||||||||||||||||||||||||||||||||
2. Action List Review (List of actions)
|
||||||||||||||||||||||||||||||||
·
5 May 2009 – CNAF completes, and
sends to R.Wartel, their plans to solve the issues concerning security
procedures, alarm contact point, etc. L.Dell’Agnello stated
that CNAF completed their internal tests and will send a report to R.Wartel.
The Italian ROC security manager will also send his report. ·
Each Site should send the URL of the XML File
with the Tape Metrics to A.Aimar Not done by: DE-KIT,
FR-CCIN2P3, NDGF, NL-Tier-1, US-FNAL-CMS Sites can provide what they have at the moment. See http://sls.cern.ch/sls/service.php?id=WLCG_Tier1_Tape_Metrics
Sites should send URLs to existing information until they do not
provide the required information. No
progress since last week. All Sites reported that they will probably not be
able to provide the XML file before the end of the month. M.Kasemann noted that live metrics would be
also very useful. ·
A.Aimar
finds how to display directly SLS information from all Sites, without using
the SLS interface, for July’s F2F Meeting. And also which metrics Sites are
currently displaying. Done. Examples from A.Di Girolamo where the
same metric for all Sites is aggregated in a single web page. ·
M.Schulz
should report about the status of the glExec patch on passing the environment Done in this meeting. |
||||||||||||||||||||||||||||||||
3.
LCG
Operations Weekly Report (Slides)
– H.Renshall
|
|
|||||||||||||||||||||||||||||||
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings 3.1 SummaryWas a quiet week again; with decreasing participation to the daily meetings (see slide 3). Hopefully is because of the holidays season and not because Sites only have interest during short test periods like the STEP09 tests. R.Tafirout
noted that for TRIUMF the meetings take place at 6h00 AM and, as agree with
J.Shiers, they only participate when it is really necessary. No alarm tickets this week. But a few incidents leading to SIRs. - ATLAS post-mortem on PVSS/COOL - FZK posted a post-mortem explaining their tape problems during STEP09 - RAL scheduled downtime for move to new Data Centre - ASGC seems to be recovering
Slide 5 shows an example of the kind of typical tickets for a VO (in this case LHCb, as example): -
Jobs failed or aborted at
Tier 2 -
gLite WMS issues at Tier 1
(temporary) -
Data transfers to Tier 1
failing (disk full) -
Software area files with root
owned -
CE marked down but accepting
jobs Slide 6 shows the availability plots, one can see that: -
RAL is red for CMS, white for
LHCB and green for ATLAS and ALICE. 3.2 Service Incidents ReportsPVSS
Incident (ATLAS Post-mortem). (slides 7-9) Sunday afternoon 27-6 V.Khomutnikov from Atlas reported to the Physics DB service that the online reconstruction was stopped because of an error was returned by the PVSS2COOL application (on Atlas offline DB). The error started appearing on Saturday (26-6) evening. FZK
tape problems during STEP09 (slide 10) Before STEP09 -
An
update to fix a minor problem in the tape library manager resulted in
stability problems -
Possible
cause: SAN or library configuration -
Both
were tried and problem disappeared but which one was the root cause is
unknown -
Second
SAN had reduced connectivity to dCache pools: not enough for CMS and ATLAS at
the same time à CMS asked to not to use tape First week of STEP09 -
Many
problems: hw (disk, library, tape drives), sw (TSM) Second week of STEP09 -
Added
two more dedicated stager hosts resulted in better stability -
Finally
getting stable rates 100 – 150MB/s A.Heiss
added that FZK will repeat the STEP09 tests in agreement with ATLAS and CMS.
The intervention on tapes was supposed to be transparent and the Experiments
were informed. I.Bird
replied that the changes were never announced to the WLCG and they should
have been. Was agreed that all interventions should be announced and reported
to the whole WLCG project. Announcing
it to the Technical Advisory Board at FZK is not enough and is not what was
agreed. I.Bird,
J.Shiers and A.Heiss agreed to clarify the issue after the meeting. RAL
scheduled downtime for DC move (slide 11) Friday 3/7: reported still on schedule for restoring CASTOR and Batch on Monday 6/7. Despite presumably hectic activity with equipment movements, RAL continued to attend the daily conf call Planning and detailed progress reported at: http://www.gridpp.rl.ac.uk/blog/category/r89-migration ASGC
Instabilities (slide 12) ATLAS reported instabilities at the beginning of the week. CMS allowed the full week grace period for ASGC to recover from all its problems. Both ATLAS and CMS specific site tests changed from Red to Green during the week. On Friday 3/7 Qin Gang reported that tape drives and servers are online. |
|
|||||||||||||||||||||||||||||||
4. Update on the HEP SSC Preparation (Prelim.
Call Info.; Slides;
Document)
– J.Shiers
|
|
|||||||||||||||||||||||||||||||
J.Shiers
reported on the workshop in Paris about the preparation of the HEP SSC
proposal. Details
are at: https://twiki.cern.ch/twiki/bin/view/LCG/HEPSSCPreparationWiki
Other
material is also available: -
The
EGI_DS “Blueprint”
document describes potential role of “Specialised Support Centres” -
Within
the context of EGEE NA4, several preparation meetings have been held. Most
recently: May in Athens, Paris in July. See Indico
for agendas and presentations -
In June
there was an Information Day
in Brussels which clarified the specific areas targeted by this call – as
well as possible funds. -
More
information on “HEP SSC” was given at the recent OB meeting
4.1 Sections concerning WLCGIn
particular the section interesting for the WLCG MB are: –
1.2.1.1
“EGI”- including “generic” services and operation required by WLCG. (e.g. GGUS, etc –
“the usual list”) –
1.2.1.2 Services
for large existing multi-national communities –
The funding for 1.2.1.1 + 1.2.1.2 = EUR25M; a joint
proposal is expected – Some people say / think that there is EUR5M for 1.2.1.2 (AFAIK not
written down anywhere) and that the EUR5M should be shared with at least 1
other (than WLCG) large community – 1.2.3 “Virtual
Research Communities” = EUR23M – Currently 2-3 “SSC”
proposals foreseen; ideally should be 1 but is not converging. – P2: combining
Astronomy & Astrophysics, Earth Science, and Fusion; – P1: combining the
training, dissemination, business outreach; – P0: combining the other scientific SSCs (high-energy physics, life
science, computational chemistry and material science, grid observatory, and
complex systems). Our stated plan for the “HEP SSC” is for a
EUR10M project over 3 years, 50% of funding coming from EU, dependant on
details such as exact scope, partners etc. Also other possible areas of funding, e.g. – 1.2.1.3 m/w (separate (important) topic, not this talk); – Others: probably
too much fragmentation: focus on the above 2 (3) areas Obviously, what we target in the sum of all
3 areas should be consistent and meet our global needs. The
Workshop in Paris showed that the bottom-up approach, i.e. collecting needs
from the HEP community and partners, was confirmed. The FP7 information day
has helped to clarify how much funding might be available and for which
specific purposes. 4.2 WLCG Input Required and TimelineWLCG must provide input on its needs on each section: -
1.2.1.2:
Services that we should target -
1.2.3: Goals and work plan of a “HEP SSC”. Is clear that this is “more than LHC”, “more
than HEP” – exact scope still to be defined urgently. Proposed timeline: first draft prior to “meeting” during next week’s
MB slot. -
Before
end July: Need input from LHC experiments -
2nd
August: Concentrated proposal writing -
September:
Reviews and revisions in September I.Bird noted that WLCG
should specify everything that is needed in the proposals. Not relying on
other external proposals. Evolution of the current services may imply some
development. J.Gordon asked whether
1.1.2.1 and 1.1.2.2 are in a single proposal and J.Shiers replied positively.
4.3 Proposal Writing Timetable
Into this schedule we must also fit: - 1-2 meetings with EU commission (in Brussels?) - Several SSC and other preparation meetings in a wider context - Vacations - “SEPT’09” or how the future tests will be called. - LHC restart preparations 4.4 EGI_DS Status of the DocumentApplication and Community Support chapter: current draft is 0.7 Plan is to incorporate comments / corrections from yesterday’s meetings, plus further input from Application Communities, well before end July. Versions 0.8, 0.9 will be needed. A further revision – 1.0 – will be made early September for consistency with draft “EGI” proposals. Specific changes include: - Manpower required; size of community affected; specific call areas targeted (e.g. 1.2.1.2, 1.2.3) - Input on Services from large communities, in particular WLCG, plus others - “SSC” input from HEP (other than WLCG) and other communities The timescale for writing proposals very tight The transition document should be another document in itself, not cut and paste from a document with a different purpose. Target last two weeks of August for intensive writing of (1.2.1.1), 1.2.1.2, and 1.2.3 sub-proposals. September will be used for multiple reviews prior to public presentation during EGEE’09. Two 2 hour sessions for “HEP SSC”: more than HEP, more than an SSC. For WLCG, September may well include a “STEP’09” rerun followed / overlapping with preparations for LHC restart and data taking. Is imperative to respect this timescale and not over commit. J.Gordon noted that there were partner for
other areas but what are the partners in the HEP community? J.Shiers replied that 3 partners are needed
in addition to CERN. INFN and FZK have participated to the meetings. INFN found
the past model of collaboration a good setup. 4.5 EGEE09 SessionsThere
will be dedicated meetings during the EGEE 09 conference. – Tuesday 22: – Wednesday 23: – Thursday 24: – Friday 25: – Suggest a small F2F
both Thursday and Friday pm J.Shiers added that for the moment also all
Operations and Middleware Releases are included. Whether they will be in this
proposal or another will be seen. But it is important to prepare the full
picture of what is needed and then see how it is split into separate
proposals if more appropriate. M.Kasemann asked whether general service
and Experiments-specific services will be included. For instance FTS is a
basic Service and PheDex is built on top. Are they both included? I.Bird replied that the proposal is about a
thin layer of general services. Services like FTS are in a middleware call,
if they will not be it will be included in this call. One should find a way
to include the Experiments Services too. O.Smirnova noted that innovation is an
important quality of the proposals. I.Bird replied that the innovation will be
a clear product of the proposal: The support of a European project of the
size of the WLCG is already innovative by itself. |
|
|||||||||||||||||||||||||||||||
5. CMS QR Report 2009Q2 (updated
slides) – M.Kasemann
|
|
|||||||||||||||||||||||||||||||
M.Kasemann
presented the 2009Q2 quarterly report for CMS. 5.1 Tier-1 and Tier-2 Sites ReadinessThe
Site readiness is closely monitored for all Tier-1 and most Tier-2 sites: The
tools finalized early 2009 and reports and follow-up during weekly Facility
Operations meetings. This quarter there were additional meetings to focus on
Asian and Russian and Turkish sites. Substantial
improvement is observed for large number sites. Sites
below 60% in March with big improvements until June now are > 80% ready:
BR-UERJ, KR-KNU, US-Caltech, ES-IFCA, AT-Vienna,
UK-London-IC, RU-ITEP, UK-Bristol and > 60% ready: IN-TIFR, IT-Rome,
RU-JINR, TR-Metu, RU-SINP The CMS
Site Readiness web page is here: https://twiki.cern.ch/twiki/bin/view/CMS/PADASiteCommissioning#ScMon
Below
is the Tier-1 Sites readiness over one month in June 2009. Compared
to March 2009. For the
Tier-2 Sites in June (36 sites > 80%, 10 sites < 60%) and March 2009 (26
sites > 80%, 21 Sites < 60%).
The
averages and the “jitter” in these plots are: •
Tier-1: average is 5(+2 or -1), the spread is 3 = 60% •
Tier-2: average is 43, spread is 6 + more sites will get ready. This is
still not production quality. 5.2 STEP09 for CMSThe CMS Emphasis was on: - Tier-0: Data recording in parallel with other experiments - Tier-1: Tape access, testing simultaneously pre-staging, processing - Tier-2: Analysis at Tier-2 :Demonstrate ability to use 50% of pledged resources with analysis jobs Data transfers: - Tier-1 →Tier-1: replicate 50 TB (AOD synchronization) between all Tier-1s - Tier-1→Tier-2: stress Tier-1 tapes, measure latency in transfer to Tier-2 The final report during WLCG STEP09 post mortem workshop 10/11. July. Tier-0
Tape Writing The target of 500MB/s was exceeded in both testing periods. - Structure in first period due to problems in disk pool management - Monitoring of tape writing and reading rates per VO can be improved CMS had 2 weeks of tests, one overlapping with ATLAS’ activity. Tier-1 Tape Writing For reprocessing of MC the required tape
read rate between 50-250 MB/sec was tested, calculated according to the
amount of data to be stored at Tier-1 centre. Overlapping tests at Tier-1
Sites with ATLAS
was performed at some centres. Preliminary results at Tier-1 centres are being studied. Some sites met the metrics every day of the test. Other Tier-1s met the metrics approximately 75% of the time. At some centres the configuration and the overall stability has to improve, tests have to be repeated. For instance at FZK. The bottlenecks were generally in the underlying tape systems and not in the ability of CMS to request data staging. CMS is carefully checking with the sites the implementation of tape families at each Tier-1 centre, which tends to concentrate data needed together on the same physical tapes. Additionally CMS will be more actively managing the files expected to be on disk with. Data Operations Tier-1 Staging and
Re-processing Exercise rolling re-reconstruction: - Pre-stage 1 day worth of data and process it the next day - Minimize disk consumption - Maximize CPU efficiency because input is on disk Pre-staging was used for the first time in this planned and organized manner. Very good performance of all sites under multi-VO load. CPU efficiency comparing w/o pre-staging measured, to be followed up. But not for all Sites. On some Sites pre-staging did not provide any improvement. Tier-1 Transfer Tests Emphasis on Tier-1→Tier-1 transfers exercises: - Use AOD synchronization between Tier-1 sites after re-reconstruction - Synchronize 50 TB of data between Tier-1 sites, two tries All
sites participated. Transfer test was completed satisfactory and the links
provided good rates between all sites Analysis Tests
at 49 Tier-2 + 8 T3 The
test was the measure percentage of analysis pledge used with standard
analysis job: reading data, no stage-out to other Tier-2. CMS was
capable of filling majority of sites at their pledges, or above. They used in
aggregate more than the analysis pledge. Roughly 80% was the success rate,
and 90% of failures are read errors. A total of almost a million jobs. 5.3 Preparation for Data TakingComputing shifts will start for CRAFT cosmics running (July 22) with presence at CERN, FNAL or other CMS centre required. For description see: https://twiki.cern.ch/twiki/bin/view/CMS/ComputingShifts 5.4 SL5 MigrationThe GDB recommends starting migration after STEP09. CMS is recommending to all CMS sites to migrate as soon as possible. The proposed migration deadline of September 1st, 2009. The current status is: - 6 Tier-1 Sites will be done by September - IN2P3 50% by Sept., 100% by end of 2009 - 25 Tier-2’s will be done by September, 6 not, answer pending from the rest. F.Hernandez
added that IN2P3 will provide by September a test setup with several CEs for
the Experiments to test and approve before installing. After this during
September all resources will be migrated. J.Gordon
added that most Tier-1 Sites are doing the same: First a test set for the
Experiments and then the whole migration. Coordinated by Facility Operations, site polling at :( https://twiki.cern.ch/twiki/bin/view/CMS/Poll-Tier-1Tier-2-SLC5) together with Offline for software validation is checking that Sites migrate to SL5. CMS SLC5 migration plan at CERN - Migrate 10% by today/tomorrow, then: checked by CMS Tier-0 teams - If ok, migrate the rest of CMS Tier-0 + CAF by July 19. CMS SL5 migration documentation is here: https://twiki.cern.ch/twiki/bin/view/CMS/SLC5Migration 5.5 Future Tests and ProductionAs a STEP09 follow-up there will be targeted tests at some Tier-1 Sites to verify that problems are solved. If needed together with ATLAS An analysis end-to-end test is planned, its goals are: - For computing: verify that all processing steps are “luminosity-calculation save”, i.e. no un-accounted loss of events - These are a series of functional tests at Tier-0, Tier-1 and Tier-2. Major MC production for 2009 LHC data is about to start: - CMSSW release expected in a few days - Plan for 4 weeks for validation - An initial sample of ~200M events are required for 2009 analysis - The plan is to finish production in September 2009. 5.6 Summary2-day CMS Global runs performed since March, about every week - Long Cosmics run starting July 22. - Monte Carlo Production at slower rate all the time. STEP09 was a valuable exercise with many tests overlapping with ATLAS and others. More information at the WLCG workshop July 9-10. Big improvement observed for stability and readiness of Tier2 sites. Tier-1 sites need to finish upgrades, need to show stability. More specific tests will be performed where needed. An analysis-end-to-end test is planned in late summer. Large MC production is prepared, will start in a few days with new version of CMSSW. |
|
|||||||||||||||||||||||||||||||
6. GlExec Update (Slides) – M.Schulz |
|
|||||||||||||||||||||||||||||||
Pilot
services have seen first tests by LHCb and ATLAS at NIKHEF and Lancaster. The
problem with expired VOMS attributes has been solved on the pilot. LHCb
reported first tests of the environment conservation scripts; they seem to
work and the Integration with the Dirac framework started. Latest
gLite production release (update 49) includes it. http://glite.web.cern.ch/glite/packages/R3.1/updates.asp
SCAS,
glExec released to production, with known issues and the script is still in
certification The
known issues below are addressed in Patch 3084 and 3050. Both are ready for
certification - User ban plugin in LCAS will not work - Malformed proxies crash glExec - 5 or more proxy delegations make the voms-api segfault The
wrapper has been given to OSG for inspection and will come back next week
with feedback. Still tests with the production infrastructure are
needed. The pilot is available: LHCb
reports that the configuration of SCAS and glExec is complex and error prone.
SA3 follows up on this. There
is a GlExec-WN packaging problem. For sites that install gLite on shared file
systems glExec has to be provided independently. There was a phone conference
on July 6th and the general consensus (on 2.5 routes), but no
timeline up to now. M.Litmaath can provide more details if needed Integration
with ARGUS, the new auth framework, will also be tested but is a longer term
goal. |
|
|||||||||||||||||||||||||||||||
7. AOB
|
|
|||||||||||||||||||||||||||||||
It is not sure that there will be an August F2F MB meeting. Depends on whether the August GDB will take place. Most likely not. |
|
|||||||||||||||||||||||||||||||
8. Summary of New Actions |
|
|||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||