LCG Management Board
Tuesday 4 March 16:00-18:00 – F2F Meeting
(Version 1 - 7.3.2008)
A.Aimar (notes), L.Betev, I.Bird (chair), K.Bos, T.Cass, Ph.Charpentier, L.Dell’Agnello, T.Doyle, I.Fisk, S.Foffano, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann, M.Lamanna, S.Lin, E.Laure, U.Marconi, H.Marten, P.Mato, G.Merino, A.Pace, Di Qing, M.Schulz, J.Shiers, R.Tafirout, J.Templon
Mailing List Archive:
Tuesday 11 March 2008 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous MB meeting were approved.
2. Action List Review (List of actions)
Actions that are late are highlighted in RED.
Done. The MB Members agreed that a milestone should be added in order to have metrics about tape efficiency at all Tier-1 sites. In order to be ready for the CCRC08-May the metrics should be in place by end of April 2008. The metrics to produce are the ones in the Tape Efficiency Wiki or equivalent. If sites can provide more metrics is even better.
New Milestone added to the HLM dashboard
T.Cass queried the comment from Brookhaven that none of the metrics were applicable, particularly as some data had already been provided. His understanding was simply that the value for the total data rate was slightly different as the statistics from HPSS do not include the time for tape mounting. This should be confirmed by Michael Ernst.
F.Hernandez proposed that sites that cannot produce the proposed metrics should be allowed to define equivalent metrics.
A new action is set for the sites to define, within 2 weeks, their tape efficiency metrics if needed.
18 Mar 2008 - Sites should propose new tape efficiency metrics that they can implement, in case they cannot provide the metrics proposed.
Only received by ASGC. A.Aimar will ask for an explicit confirmation from Sites and Experiments.
Not done. Asked to GridView but it still needs to be implemented.
On the way. Being verified.
3. CCRC08 Update (Draft LCG Services QR; Slides) - J.Shiers
J.Shiers presented the weekly update about the CCRC08 activities. This was not a full CCRC08-Feb summary. A comprehensive summary will be prepared and presented in future meetings.
In the last week of February, CCRC08 progressed without major problems. There were successful combined data exports ran for several days at rates of 1-2 GB/s. CMS alone has run close to 1GB/s.
All the CCRC08 work continued in parallel to many other activities. People although busy with other activities and meetings were able to also “run the challenge” for extended periods. Moving to a disciplined and controlled run is what CCRC08 is successfully fostering.
It is still hard to get open and timely analysis of some problems. Clear explanations are really important –everybody should try to do better:
- The on-call and expert call-out is still not well advertised.
- There is the standing request to extend expert call-out also to Tier1s and they should update their site emergency contact details (see action list).
- The use of e-logbooks should increase. There is still relatively little input from the sites; instead it would be important to get the complete picture of the activities.
- The attendance at (quasi-)daily con-calls is sometimes inadequate. It was important to always have Experiments represented (often by EIS members); but relatively few Sites attend.
3.3 Looking Ahead
CCRC’08 has been a very useful exercise that has resulted in further hardening of the WLCG production services. There is still much to do – some for May and some for beyond.
WLCG have shown we can run production services with reasonable load on people involved; but more load is expected in the future.
The next steps to face for the CCRC08 are:
- April F2Fs: agree on middleware and storage-ware for May
Collaboration workshop: the discussion available on this agenda:
- June “post-mortem” workshop: booked both IT Amphitheater and the Council Chamber for the moment.
3.4 Considerations on WLCG and EGI
The EGEE infrastructure (among the others) has been a big success in preparing and running the WLCG CCRC08 challenges.
The EGI workshop in Rome is under preparation, an analysis of CCRC08 is useful to foresee future scenarios where WLCG would be supported by the EGI infrastructure.
Could CCRC’08 have run successfully in the EGI environment?
The clear answer to this question today is no. This is an important issue that WLCG will need to address with priority, so that it is no longer true before 2010. So many problems have been solved by the being near to the EGEE operations and deployment teams. A remote organization requires even better communication channels and procedures.
What needs to be in place and by when to ensure that CCRC’10 – and 2010 data taking and production processing – is successful?
The answer to the second question has significant relevance to the EGI_DS project, as well as the transition to/start-up of EGI.
J.Gordon noted that the information is actually coming from all over EGEE to the teams at CERN. Therefore the communication is already working quite effectively.
I.Bird noted that this took 4 years of EGEE to get it to work effectively.
J.Shiers added that in the future these shortcuts will not be possible. In the year 2010, the yet-to-start EGEE III project will terminate and one cannot assume that there will be an EGEE IV project. WLCG must plan for the different possible scenarios.
In addition EGEE III funded people, those who started in EGEE, must leave CERN if not granted an indefinite contract. The current infrastructure must be with the same infrastructure as we will use for 2010 data taking. No other infrastructure can be in place by late 2009.
The minimal requirements are basically as out-lined in EGEE III operations manpower plan, plus also:
- Middleware development: reasonable stability for what WLCG needs should have been attained by this time
- Storage-ware: likely to still be (smoothly) evolving
- LCG Operations and Application / User support: outside scope of EGI and continues as now (GDB, WLCG section of operations meetings, WLCG Collaboration and other workshops, WLCG procedures etc.)
Important features are planned at the same time as start-up funding for NGIs, migration scenarios, guarantee of non-disruptive evolution etc.
I.Bird noted that if the NGIs do not agree to support a single set of middleware it will be a major issue to share the resources. WLCG should clarify what is needed for progressing smoothly.
J.Gordon added that the sites should agree to run several middleware so that WLCG can specify the one needed.
I.Bird noted that by mid-2008 WLCG should have clear whether the EGI is being designed in a way that can support the WLCG. Maybe this could be discussed at the April’s Workshop http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=6552.
In addition also the Tier-2’s situation should be presented and discussed.
4. LHCb QR Report (Slides) - Ph.Charpentier
Ph.Charpentier presented the status and progress of LHCb during the last quarter.
4.1 Activities since November ‘07
LHCb completed the testing of their Core Software and the Application Area’s packages, mainly:
- Test latest release of ROOT (5.18)
- Certification of LCG 54
The production activities performed were:
- Simulation continued at a low pace (few physics requests).
Stripping of MC signal data.
Took place at most Tier1s, with some problems with data access and
- Preparation of data sets for CCRC’08. Now using 1.5 GB files using “RAW data” from MC files (grouping 100 input files)
The LHCb Core Computing focused on the development of DIRAC3 for CCRC’08
- Re-engineering of the whole DIRAC (WMS and DMS)
- Now SRM v2.2 usage through gfal python API
- gLite WMS usage
4.2 Sites Configuration
LHCb is in contact with the Sites to deploy now the LFC mirror:
- DB replication using 3D from CERN to all Tier1s. In place for 6 months.
- LFC service for scalability and redundancy. In production at CNAF, RAL, IN2P3 (GridKa coming). Missing PIC and SARA.
Site SE migration was in progress during the quarter and was very manpower intensive:
RAL (dCache to Castor2).
PIC (Castor1 to dCache for T1D0).
CNAF (Castor2 to StoRM for TxD1)
4.3 DIRAC3 for CCRC08
DIRAC3 being commissioned during the quarter:
- Most components are ready, fully integrated and tested
- Basic functionality (equivalent to DIRAC2) is already available
Two weeks ago LHCb had a full rehearsal week, all developers came to CERN in order to follow the progress of the challenge and fix problems as quickly as possible.
DIRAC3 planning (as defined on the 15 Nov 2007)
- 30 Nov 2007: Basic functionality
- 15 Dec 2007: Production Management, start tests
- 15 Jan 2008: Full CCRC functionality, tests start
- 5 Feb 2008: Start tests for CCRC phase 1
- 18 Feb 2008: Run CCRC
- 31 Mar 2008: Full functionality, ready for CCRC phase 2 tests
The current status is on time with above schedule but one has to notice that several features (e.g. SRM-related) still have to be clarified.
4.4 LHCb Activities during CCRC
Raw data upload: Online to Tier0 storage (CERN Castor)
- Use DIRAC transfer framework. Exercise two transfer tools (Castor rfcp, Grid FTP)
Raw data distribution to Tier1s
- The LHCb Tier-1 sites are: CNAF, GridKa, IN2P3, NIKHEF, PIC, RAL
- Use gLite File Transfer System (FTS), based on SRM v2.2
- Share according to resource pledges from sites
Data reconstruction at Tier0+1
- Production of RDST, stored locally (using SRM v2.2)
- Data access using also SRM v2 (various storage back-ends: Castor and dCache)
For May: stripping of reconstructed data
- Initially foreseen in Feb, but de-scoped
- Distribution of streamed DSTs to Tier1s
- If possible include file merging
Data sharing is split according to Tier1 pledges (as of February 15th).
The LHCb SRM v2.2 space token descriptions are:
- LHCb_RAW (T1D0)
- LHCb_RDST (T1D0)
- LHCb_M-DST (T1D1) – not needed for February CCRC (no stripping)
- LHCb_DST (T0D1) – not at CERN
- LHCb_FAILOVER (T0D1)
And they are used for temporary upload in case of destination unavailability.
All data can be scrapped after the challenge
- Test SRM bulk removal was tested already during the challenge.
- Based on 2 weeks run there are 28,000 files (42 TB)
The plans for CCRC’08 in May are:
- 4 weeks continuous running
- Established services and procedure
4.5 Data from Pit 8 to CASTOR
During the first weeks in February there was a continuous transfer at low rate. The picture below shows the network utilization during 12 hours. In green there is the migration and the transfer to the Tier-1 sites. The peeks are due to the CASTOR storage patterns.
Since the 18 February at nominal rate (70 MB/s) with a ~50% duty cycle. The migration starts after the transfers are overlapped with the move of the data to the Tier-1 sites.
The plot below shows the constant increase of migrated files.
4.1 Tier-0 to Tier-1 Transfers
Transfers are taking place to all the 6 Tier-1 sites:
- Share according to pledges works well
- Some backlog effects were observed, even at low rate. The files are only transferred after successful migration and CRC check
- Some problems were seen at IN2P3 (dCache configuration). “Space full” even on T1D0, bug reported to dCache
File removal was also tested:
- SRM v2 removal works, but the space is not recovered (is a bug).
Every two hours the files were transferred to all the 6 LHCb Tier-1 sites.
And below is the daily average for 10 days in February.
4.2 Tier-0 and 1 Reconstruction
It is now using the new DIRAC3 WMS which uses gLite WMS for launching pilot jobs. And is used also at Tier0.
- Using SRM v2.2 for file access (srm PrepareToGet). Reminder: data access from disk servers (rootd, rfio or gsidcap)
Slow start in order to debugging but now jobs are submitted steadily and running at all sites. The main issue open is that dCache sites were resetting the gsidcap ports and this was not caught properly by ROOT. It believes EOF reached instead so the job terminates successfully but not all events are really processed
The last quarter was mainly devoted to development and testing of DIRAC3.
The Simulation, reconstruction and stripping activities are ongoing at low pace still using DIRAC2.
Analysis (using Ganga + DIRAC2) is also ongoing at most sites:
- Distribution of analysis limited due to disk crisis at most sites
- Most stripped data are CERN only.
CCRC’08 is well running now:
- New week: steady processing
- March: introduce more complex workflows (stripping)
The next steps are:
- Fully commission DIRAC3 for simulation and analysis
- Get ready for 4 weeks steady running at nominal rate in May
LHCb would like to include analysis using generic pilot jobs as soon as possible: LHCb needs the Pilot Jobs approved
LHCb is ready as soon as approved by the ad-hoc working group and gLEexec is deployed on all sites
5. SAM Tests Update (Slides) - M.Schulz
M.Schulz presented a summary of the status of the SAM Availability tests and how the calculations are performed.
5.1 SAM Data Management Tests
There are two kinds of SE tests: central and distributed tests.
Central tests: they execute “put, get and delete” operations
- Tests are run from the CERN SAM UI
- All SEs are tested
tests run via lcg-util commands
Lcg-util default configuration: therefore the SRM-V1 interface is tested
- Same tests, but only the “Default-SE” is tested
- Some sites configured this for OPS to point to a dedicated Classic-SE
In summary: The SAM tests do NOT test SRM-V2 interfaces!!
LCG-util clients had initial no command line options for selecting protocols.
Now it has:
- No option == support for legacy clients (srm-v1)
- -D == indicate preferred version (defaults back to v1)
- -T,-U == specify mandatory versions for source and destination
Using classic SE as default SE for OPS VO
- Site admins have a way to get an orthogonal independent set of tests
- Tests verify different services at the same time. SE, InfoSystem, Catalogue, WN config.
In summary: We miss relevant aspects in our tests. What to do?
5.2 Short and Long Term Actions
The short-term actions proposed are:
- Modify tests to test both: SRM-V1 and SRM-V2 (SAM team)
- OPS should use same SE as Experiments not special dedicated SEs. (Site admins)
L.Dell’Agnello noted that the Experiments use different SEs, which SE should the OPS test?
M.Schulz agreed that having several VOs it is difficult to test them all with the OPS tests. One will have to configure the VO-tests to check some VO-specific SEs.
Or at least sites should use the same technology as the one used by the real VOs. For instance do not use a classic SE for OPS while the HEP VOs use CASTOR in reality. Om that case OPS should test CASTOR too.
I.Bird added that the OPS tests are still needed until the VO-specific test fully replace them adequately. OPS tests are used for the reporting since more than one year. Changing these tests should be done in a managed way or it will artificially modify the validity over time of the data collected.
Longer term actions proposed are:
- Better interpretation of test results
- Define dependency matrix among tests
addition “static” file based information
- Test the catalogues independently
- Test accessibility of all SEs on a site in a lighter way.
5.3 Availability and Reliability Calculations
The current algorithm is best understandable in the document originally approved by the MB last years.
Reliability: 13th Feb 2007 LCG MB
- Reliability = Availability / ScheduledAvailability
- ScheduledAvailability = 1 – ScheduledDownTime – UnknownInterval
- UnknownInterval = TimeWithUnknownResults/Time
- UnscheduledDownTime = 1 – Availability - ScheduledDownTime - UnknownInterval
Current algorithm differs with respect to:
a) Service status computation on a continuous time scale
b) Consideration of Scheduled Downtime
c) Handling of UNKNOWN status
d) Validity of Test Results
A Service is up: If at least one instance is available, the service is available
A Site is up when all critical site services (CE, SE, etc) are considered up. The lowest state wins (UP UP UP DOWN == DOWN )
The Global services (WMS, LFC, etc) are up if at least one instance is up somewhere.
A Service Instance can be in the following states:
- ScheduledDown (GOCDB, tests are ignored)
- DontCare (no critical tests defined),
- UP, DOWN, ( All critical tests O.K) (at least one failed)
- UNKNOWN, ( All critical tests that have run are O.K. >=1 not available)
The current critical tests can be browsed here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOSGNDGFProbes
And there are different (but equivalent) tests for NDGF, EGEE/WLCG and OSG.
In not clear how much these tests reflect the current usage patterns and the middleware features.
The Data Management tests are using: InfoSystem, WN config, LFC, SRM-V1 and they need to be able to run independently.
The Site BDII tests are being reworked. They check that the quantities published are correctly formed. Currently there are 70 top-level BDII how many should be tested? Probably about 20 of them, to be selected.
I.Bird noted that the top-level BDII are not assigned to a site, but if a site points to a non-working top-level BDII is a failure of that site. Sites should check the BDII they point to.
J.Gordon agreed that is up to the site to point to a working BDII server and verify it regularly.
M.Schulz suggested that new tests should be added, for now, outside the standard tests. But they should be calculated with the same algorithm in order to verify how they represent the status of the site and how they compare to the current SAM tests.
I.Bird added that the VO-specific tests should be again checked and presented at the MB every month, as was done until December.
J.Templon supported the usage of Vo-specific tests and added that they should then become part of the Nagios monitoring alarms at the sites. Sites should monitor the Vo-specific tests via Nagios and fix immediately issues that make those tests to fail.
6. Tape Efficiency (Tape Efficiency Wiki)
Already above covered in the Action List Section No. 2.
8. Summary of New Actions
The full Action List, current and past items, will be in this wiki page before next MB meeting.
18 Mar 2008 - Sites should propose new tape efficiency metrics that they can implement, in case they cannot provide the metrics proposed.
New Milestone added to the HLM dashboard
Tape Efficiency Metrics Published