LCG Management Board |
||||
Date/Time: |
Tuesday
4 March 16:00-18:00 – F2F Meeting |
|||
Agenda: |
||||
Members: |
||||
|
(Version 1 - 7.3.2008) |
|||
Participants:
|
A.Aimar
(notes), L.Betev, I.Bird (chair), K.Bos, T.Cass, Ph.Charpentier, L.Dell’Agnello,
T.Doyle, I.Fisk, S.Foffano, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann,
M.Lamanna, S.Lin, E.Laure, U.Marconi, H.Marten, P.Mato, G.Merino, A.Pace, Di
Qing, M.Schulz, J.Shiers, R.Tafirout, J.Templon |
|||
Action
List |
||||
Mailing
List Archive: |
https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/ |
|||
Next Meeting: |
Tuesday
11 March 2008 16:00-17:00 – Phone Meeting |
|||
1. Minutes and Matters arising (Minutes) |
||||
1.1 Minutes of Previous Meeting
The
minutes of the previous MB meeting were approved. |
||||
2. Action List Review (List of actions)Actions that are late are highlighted in RED. |
||||
Done. The MB Members agreed that a milestone should be added in order to have metrics about tape efficiency at all Tier-1 sites. In order to be ready for the CCRC08-May the metrics should be in place by end of April 2008. The metrics to produce are the ones in the Tape Efficiency Wiki or equivalent. If sites can provide more metrics is even better. New Milestone added to the HLM dashboard
T.Cass queried the comment from Brookhaven that none of the metrics were applicable, particularly as some data had already been provided. His understanding was simply that the value for the total data rate was slightly different as the statistics from HPSS do not include the time for tape mounting. This should be confirmed by Michael Ernst. F.Hernandez
proposed that sites that cannot produce the proposed metrics should be
allowed to define equivalent metrics. A new action is set for the sites to define, within 2 weeks, their tape efficiency metrics if needed. New Action: 18 Mar 2008 - Sites should propose new tape efficiency
metrics that they can implement, in case they cannot provide the metrics
proposed.
Only received by ASGC. A.Aimar will ask for an
explicit confirmation from Sites and Experiments.
Not done. Asked to GridView but it still needs to be implemented.
On the way. Being verified. |
||||
3. CCRC08 Update (Draft LCG Services QR; Slides) - J.Shiers
|
||||
J.Shiers presented the weekly update about the CCRC08 activities. This was not a full CCRC08-Feb summary. A comprehensive summary will be prepared and presented in future meetings. 3.1 SummaryIn the last week of February, CCRC08 progressed without major problems. There were successful combined data exports ran for several days at rates of 1-2 GB/s. CMS alone has run close to 1GB/s. All the CCRC08 work continued in parallel to many other activities. People although busy with other activities and meetings were able to also “run the challenge” for extended periods. Moving to a disciplined and controlled run is what CCRC08 is successfully fostering. 3.2 ConcernsIt is still hard to get open and timely analysis of some problems. Clear explanations are really important –everybody should try to do better: - The on-call and expert call-out is still not well advertised. - There is the standing request to extend expert call-out also to Tier1s and they should update their site emergency contact details (see action list). - The use of e-logbooks should increase. There is still relatively little input from the sites; instead it would be important to get the complete picture of the activities. - The attendance at (quasi-)daily con-calls is sometimes inadequate. It was important to always have Experiments represented (often by EIS members); but relatively few Sites attend. 3.3 Looking AheadCCRC’08 has been a very useful exercise that has resulted in further hardening of the WLCG production services. There is still much to do – some for May and some for beyond. WLCG have shown we can run production
services with reasonable load on people involved; but more load is expected in
the future. The next steps to face for the CCRC08 are: -
April
F2Fs: agree on middleware and storage-ware for May -
April
Collaboration workshop: the discussion available on this agenda: -
June
“post-mortem” workshop: booked both IT Amphitheater and the Council Chamber
for the moment. 3.4 Considerations on WLCG and EGIThe EGEE infrastructure (among the others) has been a big success in preparing and running the WLCG CCRC08 challenges. The EGI workshop in Rome is under preparation, an analysis of CCRC08 is useful to foresee future scenarios where WLCG would be supported by the EGI infrastructure. Could
CCRC’08 have run successfully in the EGI environment? The clear answer to this question today is no. This is an important issue that WLCG will need to address with priority, so that it is no longer true before 2010. So many problems have been solved by the being near to the EGEE operations and deployment teams. A remote organization requires even better communication channels and procedures. What
needs to be in place and by when to ensure that CCRC’10 – and 2010 data
taking and production processing – is successful? The answer to the second question has significant relevance to the EGI_DS project, as well as the transition to/start-up of EGI. J.Gordon
noted that the information is actually coming from all over EGEE to the teams
at CERN. Therefore the communication is already working quite effectively. I.Bird
noted that this took 4 years of EGEE to get it to work effectively. J.Shiers added that in the future these shortcuts will not be possible. In the year 2010, the yet-to-start EGEE III project will terminate and one cannot assume that there will be an EGEE IV project. WLCG must plan for the different possible scenarios. In addition EGEE III funded people, those who started in EGEE, must leave CERN if not granted an indefinite contract. The current infrastructure must be with the same infrastructure as we will use for 2010 data taking. No other infrastructure can be in place by late 2009. The minimal requirements are basically as out-lined in EGEE III operations manpower plan, plus also: - Middleware development: reasonable stability for what WLCG needs should have been attained by this time - Storage-ware: likely to still be (smoothly) evolving - LCG Operations and Application / User support: outside scope of EGI and continues as now (GDB, WLCG section of operations meetings, WLCG Collaboration and other workshops, WLCG procedures etc.) Important features are planned at the same time as start-up funding for NGIs, migration scenarios, guarantee of non-disruptive evolution etc. I.Bird
noted that if the NGIs do not agree to support a single set of middleware it
will be a major issue to share the resources. WLCG should clarify what is
needed for progressing smoothly. J.Gordon
added that the sites should agree to run several middleware so that WLCG can
specify the one needed. I.Bird
noted that by mid-2008 WLCG should have clear whether the EGI is being
designed in a way that can support the WLCG. Maybe this could be discussed at
the April’s Workshop http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=6552. In
addition also the Tier-2’s situation should be presented and discussed. |
||||
4. LHCb QR Report (Slides) - Ph.Charpentier |
||||
Ph.Charpentier presented the status and progress of LHCb during the last quarter. 4.1 Activities since November ‘07LHCb completed the testing of their Core Software and the Application Area’s packages, mainly: - Test latest release of ROOT (5.18) - Certification of LCG 54 The production activities performed were: - Simulation continued at a low pace (few physics requests). -
Stripping of MC signal data.
Took place at most Tier1s, with some problems with data access and
availability. - Preparation of data sets for CCRC’08. Now using 1.5 GB files using “RAW data” from MC files (grouping 100 input files) The LHCb Core Computing focused on the development of DIRAC3 for CCRC’08 - Re-engineering of the whole DIRAC (WMS and DMS) - Now SRM v2.2 usage through gfal python API - gLite WMS usage 4.2 Sites ConfigurationLHCb is in contact with the Sites to deploy now the LFC mirror: - DB replication using 3D from CERN to all Tier1s. In place for 6 months. - LFC service for scalability and redundancy. In production at CNAF, RAL, IN2P3 (GridKa coming). Missing PIC and SARA. Site
SE migration was in progress during the quarter and was very manpower
intensive: -
RAL (dCache to Castor2). -
PIC (Castor1 to dCache for T1D0). -
CNAF (Castor2 to StoRM for TxD1) 4.3 DIRAC3 for CCRC08DIRAC3
being commissioned during the quarter: -
Most components are ready, fully integrated and tested -
Basic functionality (equivalent to DIRAC2) is already available Two weeks ago LHCb had a full rehearsal week, all developers came to CERN in order to follow the progress of the challenge and fix problems as quickly as possible. DIRAC3
planning (as defined on the 15 Nov 2007) - 30 Nov 2007: Basic functionality - 15 Dec 2007: Production Management, start tests - 15 Jan 2008: Full CCRC functionality, tests start - 5 Feb 2008: Start tests for CCRC phase 1 - 18 Feb 2008: Run CCRC - 31 Mar 2008: Full functionality, ready for CCRC phase 2 tests The
current status is on time with above schedule but one has to notice that
several features (e.g. SRM-related) still have to be clarified. 4.4 LHCb Activities during CCRCRaw
data upload: Online to Tier0 storage (CERN Castor) - Use DIRAC transfer framework. Exercise two transfer tools (Castor rfcp, Grid FTP) Raw
data distribution to Tier1s - The LHCb Tier-1 sites are: CNAF, GridKa, IN2P3, NIKHEF, PIC, RAL - Use gLite File Transfer System (FTS), based on SRM v2.2 - Share according to resource pledges from sites Data
reconstruction at Tier0+1 - Production of RDST, stored locally (using SRM v2.2) - Data access using also SRM v2 (various storage back-ends: Castor and dCache) For
May: stripping of reconstructed data - Initially foreseen in Feb, but de-scoped - Distribution of streamed DSTs to Tier1s -
If possible include file
merging Data
sharing is split according to Tier1 pledges (as of February 15th). The
LHCb SRM v2.2 space token descriptions are: -
LHCb_RAW (T1D0) -
LHCb_RDST (T1D0) -
LHCb_M-DST (T1D1) – not needed for February CCRC (no stripping) -
LHCb_DST (T0D1) – not at CERN -
LHCb_FAILOVER (T0D1) And
they are used for temporary upload in case of destination unavailability. All
data can be scrapped after the challenge -
Test SRM bulk removal was tested already during the challenge. -
Based on 2 weeks run there are 28,000 files (42 TB) The
plans for CCRC’08 in May are: -
4 weeks continuous running -
Established services and procedure 4.5 Data from Pit 8 to CASTORDuring the first weeks in February there was a continuous transfer at low rate. The picture below shows the network utilization during 12 hours. In green there is the migration and the transfer to the Tier-1 sites. The peeks are due to the CASTOR storage patterns. Since the 18 February at nominal rate (70 MB/s) with a ~50% duty cycle. The migration starts after the transfers are overlapped with the move of the data to the Tier-1 sites. The plot below shows the constant increase of migrated files. 4.1 Tier-0 to Tier-1 TransfersTransfers are taking place to all the 6 Tier-1 sites: - Share according to pledges works well -
Some backlog effects were
observed, even at low rate. The files are only transferred after successful migration and CRC check -
Some problems were seen at IN2P3 (dCache configuration). “Space full”
even on T1D0, bug reported to dCache File removal was also tested: - SRM v2 removal works, but the space is not recovered (is a bug). Every two hours the files were transferred to all the 6 LHCb Tier-1 sites. And below is the daily average for 10 days in February. 4.2 Tier-0 and 1 ReconstructionIt is now using the new DIRAC3 WMS which
uses gLite WMS for
launching pilot jobs. And is used also at Tier0. -
Using SRM v2.2 for file
access (srm PrepareToGet). Reminder: data access from disk servers (rootd, rfio or gsidcap) Slow start in order to debugging but now jobs are submitted steadily and running at all sites. The main issue open is that dCache sites were resetting the gsidcap ports and this was not caught properly by ROOT. It believes EOF reached instead so the job terminates successfully but not all events are really processed 4.3 SummaryThe last quarter was mainly devoted to development and testing of DIRAC3. The Simulation, reconstruction and stripping activities are ongoing at low
pace still using DIRAC2. Analysis
(using Ganga + DIRAC2) is also ongoing at most sites: -
Distribution of analysis limited due to disk crisis at most sites -
Most stripped data are CERN only. CCRC’08
is well running now: -
New week: steady processing -
March: introduce more complex workflows (stripping) The next steps are: -
Fully commission DIRAC3 for simulation and analysis -
Get ready for 4 weeks steady running at nominal rate in May LHCb
would like to include analysis using generic pilot jobs as soon as possible: LHCb needs the Pilot Jobs approved LHCb is ready
as soon as approved by the ad-hoc working group and gLEexec is deployed on all
sites |
||||
5. SAM Tests Update (Slides) - M.Schulz |
||||
M.Schulz presented a summary of the status of the SAM Availability tests and how the calculations are performed. 5.1 SAM Data Management Tests
There are two kinds
of SE tests: central and distributed tests. Central tests: they execute
“put, get and delete” operations -
Tests
are run from the CERN SAM UI -
All SEs
are tested -
All
tests run via lcg-util commands Lcg-util default
configuration: therefore the SRM-V1 interface is tested -
Lcg-cr -
Lcg-cp -
Lcg-del Distributed tests: -
Same
tests, but only the “Default-SE” is tested -
Some
sites configured this for OPS to point to a dedicated Classic-SE In summary: The SAM tests do NOT test SRM-V2 interfaces!! LCG-util clients had initial no command line options for selecting protocols. Now it has: -
No
option == support for
legacy clients (srm-v1) -
-D == indicate preferred version (defaults
back to v1) -
-T,-U == specify mandatory versions for source and destination Using classic SE as default SE for OPS VO -
Site
admins have a way to get an orthogonal independent set of tests -
Tests
verify different services at the same time. SE, InfoSystem, Catalogue, WN
config. In summary: We miss relevant aspects in our tests. What to do? 5.2 Short and Long Term Actions
The short-term
actions proposed are: -
Modify
tests to test both: SRM-V1 and SRM-V2 (SAM team) -
OPS
should use same SE as Experiments not special dedicated SEs. (Site admins) L.Dell’Agnello noted that the Experiments use
different SEs, which SE should the OPS test? M.Schulz agreed that having several VOs it is
difficult to test them all with the OPS tests. One will have to configure the
VO-tests to check some VO-specific SEs. Or at least sites should use the same
technology as the one used by the real VOs. For instance do not use a classic
SE for OPS while the HEP VOs use CASTOR in reality. Om that case OPS should
test CASTOR too. I.Bird added that the OPS tests are still
needed until the VO-specific test fully replace them adequately. OPS tests are
used for the reporting since more than one year. Changing these tests should
be done in a managed way or it will artificially modify the validity over
time of the data collected. Longer term actions proposed
are: -
Better
interpretation of test results -
Define
dependency matrix among tests -
Use in
addition “static” file based information -
Test the
catalogues independently -
Test
accessibility of all SEs on a site in a lighter way. 5.3 Availability and Reliability CalculationsThe current algorithm is best understandable in the document originally approved by the MB last years. https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf Reliability: 13th
Feb 2007 LCG MB -
Reliability = Availability /
ScheduledAvailability -
ScheduledAvailability = 1 – ScheduledDownTime –
UnknownInterval -
UnknownInterval = TimeWithUnknownResults/Time -
UnscheduledDownTime
= 1 – Availability - ScheduledDownTime - UnknownInterval Current algorithm differs with respect to: a) Service status computation on a continuous time scale b) Consideration of Scheduled Downtime c) Handling of UNKNOWN status d) Validity of Test Results A Service is up: If at least one instance is available, the service is available A Site is up when all critical site services (CE, SE, etc) are considered up. The lowest state wins (UP UP UP DOWN == DOWN ) The Global services (WMS, LFC, etc) are up if at least one instance is
up somewhere. A Service Instance
can be in the following states: -
ScheduledDown
(GOCDB, tests are ignored) -
DontCare
(no critical tests defined), -
UP,
DOWN, ( All critical tests O.K) (at least one failed) -
UNKNOWN,
( All critical tests that have run are O.K. >=1 not available) The current critical tests can be browsed here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOSGNDGFProbes
And there are
different (but equivalent) tests for NDGF, EGEE/WLCG and OSG. In not clear how
much these tests reflect the current usage patterns and the middleware
features. The Data Management
tests are using: InfoSystem, WN config, LFC, SRM-V1 and they need to be able
to run independently. The Site BDII tests
are being reworked. They check that the quantities published are correctly
formed. Currently there are 70 top-level BDII how many should be tested?
Probably about 20 of them, to be selected. I.Bird noted that the top-level BDII are not
assigned to a site, but if a site points to a non-working top-level BDII is a
failure of that site. Sites should check the BDII they point to. J.Gordon
agreed that is up to the site to point to a working BDII server and verify it
regularly. M.Schulz
suggested that new tests should be added, for now, outside the standard tests.
But they should be calculated with the same algorithm in order to verify how
they represent the status of the site and how they compare to the current SAM
tests. I.Bird
added that the VO-specific tests should be again checked and presented at the
MB every month, as was done until December. J.Templon
supported the usage of Vo-specific tests and added that they should then
become part of the Nagios monitoring alarms at the sites. Sites should monitor
the Vo-specific tests via Nagios and fix immediately issues that make those
tests to fail. |
||||
6. Tape Efficiency (Tape Efficiency Wiki) |
||||
Already above covered in the Action List Section No. 2. |
||||
7. AOB |
||||
No AOB. |
||||
8. Summary of New Actions |
||||
The full Action List, current and past items, will be in this wiki page before next MB meeting. |
18 Mar 2008 - Sites should propose new tape efficiency
metrics that they can implement, in case they cannot provide the metrics
proposed.
New Milestone added to the HLM dashboard
WLCG-08-02 |
April 2008 |
Tape Efficiency Metrics Published |