LCG Management Board
Tuesday 6 November 2007 16:00-18:00 – F2F Meeting
(Version 1 - 13.11.2007)
A.Aimar (notes), D.Barberis, O.Barring, I.Bird, Ph.Charpentier, L.Dell’Agnello, T.Doyle, M.Ernst, S.Foffano, J.Gordon, C.Grandi, A.Heiss, F.Hernandez, J.Knobloch, M.Lamanna, E.Laure, S.Lin, U.Marconi, G.Merino, B.Panzer, D.Petravick, , L.Robertson (chair), Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout
Mailing List Archive:
Tuesday 20 November 2007 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous MB meeting were approved.
1.2 Update on Site Names (Sites Names)
All names are now agreed and will progressively be used in all reports and tables.
The final names have been agreed.
1.3 Sites Reliability and Job Efficiency Reports for October 2007 (SR and JE Tables)
A.Aimar will write to the sites to ask for the completion of the sites reliability reports that are incomplete after the Operations weekly reports.
Update 13.11.2007: the JE Tables for ATLAS Ganga seems incorrect.
2. Action List Review (List of actions)
Actions that are late are highlighted in RED.
· D.Barberis agreed to clarify with the Reviewers the kind of presentations and demos that they are expecting from the Experiments at the Comprehensive Review.
Done two week ago.
Done, by L.Robertson
SRM 2.2 Deployment
The process of upgrading sites to SRM v2.2 in production has started:
- The first site - NDGF - is now running SRM v2.2 in production.
- FZK is in the processing of upgrading.
- CASTOR2 SRM 2.2 services are being setup this week
LHCb SRM Testing
LHCb testing was presented at today's GSSD.
The need for experiment preparation for
SRM v2.2 - in SRM v2.2 mode - well prior to CCRC'08 was emphasized at today's
It is expected that CCRC and SRM usage will go in two phases:
- First focusing on the Tier0 + Tier1 sites
- And then involving also the Tier2 sites
L.Robertson asked whether the testing by the Experiments is considered sufficient by all parties and deployment can proceed as planned.
A.Heiss replied that the installation is proceeding well in FZK and D.Barberis added that for ATLAS the tests seem sufficient for proceeding with the installations at the sites.
2. Update on CCRC-08 Planning (CCRC'08 Meetings, Slides) - J.Shiers
Changes - J.Shiers commented the changes compared to the previous weekly summaries. The proposals are:
- Use January (pre-)GDB to review metric, tools to drive tests and monitoring tools
- Use the March GDB to analysis CCRC phase 1
- Launch the May challenge at the WLCG workshop (April 21-25, 2008)
- Schedule a mini-workshop after the challenge to summarize and extract lessons learned
- Document performance and lessons learned within 4 weeks.
News - The Tier-2 coordinators have been nominated and are now added to the CCRC mailing list.
Excellent Tier0-Tier1 transfers from both ATLAS&CMS
F2F CCRC Meeting - The sites have expressed the need to better clarify what the experiments expect from each site explicitly and in some more detail. There are quite a lot of details that need to be worked through this year (in time for February challenge).The same will be true for the May challenge.
For the December GDB it is not clear that a ½ day F2F will be enough.
SRM 2.2 - SRM v2.2 was explicitly listed by 3 experiments (#1) as a pre-requisite for CCRC’08. Implicitly by ALICE – required for Tier0-Tier1 FTS
The CCRC Workshop should be in June (12-13) just following the GDB.
Documentation - Is important to
document the lessons learnt because are a lasting value.
3. LHCC Comprehensive Review Agenda (Agenda) - L.Robertson
L.Robertson showed the Agenda of the LHCC Comprehensive Review.
The only change is the move of the “Asia-Pacific Tier-2s (30')” by Glenn Moloney (Univ. of Melbourne) to the morning of the Second Day and “Management, Planning and Communication” at 17:00 on the First Day.
D.Barberis reported that the agreement with the referees is that the Experiments will provide “walk-through” demonstrations of their applications and not complete demonstrations really running. For instance ATLAS will provide an end-to-end demonstration, from data selection in the catalog to submitting jobs, retrieving the output and finally producing plots. This will be explained showing screen shots and explanations of what needs to be done to execute the applications. Also the other Experiments will prepare something similar.
L.Robertson will send a reminder to all speakers of the LHCC Comprehensive Review.
4. ATLAS Quarterly Report and Plans (Slides) - D.Barberis
D.Barberis presented the quarterly report on progress and plans of the ATLAS Experiment.
Please refer to the Slides for more details.
4.1 Recent news from ATLAS
Slides 3-5 show the status of the construction and the schedule that shows that from end of April the pit should be closed.
4.2 M4 Cosmic Run
ATLAS had a cosmic run this summer - August 23 – September 3, called M4 - with the first large scale export of data to the Tier-1 sites.
M4 had these characteristics:
- Using 4 SFOs with a data: rate < ~250 MB/s
- Data written into Castor 2 (~40 TB)
- Full Tier-0 operation
- RAW data subscribed to Tier-1 tape
- ESD data subscribed to Tier-1 disk
- ESD data subscribed from Tier-1s to Tier-2s
- Analyse M4 data at Tier-2s
By the end of the M4 challenge all the goals above were reached (see slides 7-8).
4.3 Data and Streaming Decisions
Slide 9: There is no ‘obvious’ right way to stream therefore flexibility is vital. The overlaps, of events in more than one stream, vary with luminosity.
All Streaming (RAW, ESD, AOD) is based on trigger decisions. The baseline is to have ~5 physics streams, plus express stream and calibration streams.
The Physics streams are “inclusive”; i.e. one event may be in >1 streams depending on the triggers (e+γ, μ+Bphys, jets, τ+Et miss ,minbias) there can be overlaps of ~10%. The ESD streams will be the same as the Raw streams, the AOD streams from central production and reprocessing.
ATLAS has also been working on the definition of Derived Physics Datasets (DPD) that are used to represent many derivations (skimmed AOD, data collections, augmented AOD, other formats). In each case, aim is to be faster, smaller data and more portable formats.
Decisions have been taken (see slide 11 and 12) about Data on Disk at the Tier-2 sites.
There will be ~30 Tier-2 sites of very different size containing some of ESD and RAW data:
- In 2007: 10% of RAW and 30% of ESD in Tier-2 cloud
- In 2008: 30% of RAW and 150% of ESD in Tier-2 cloud
- In 2009 and after: 10% of RAW and 30% of ESD in Tier-2 cloud
- This will largely be ‘pre-placed’ in early running and consist of recall of small samples through the group production at T1
Additional access to ESD and RAW will be in the CAF: about 1/18 RAW and 10% ESD.
In total there will be about 10 copies of full AOD on disk at the sites.
In order to perform On Demand Analysis ATLAS will need:
- Restricted Tier 2 sites and CERN Analysis Facility (CAF)
- Most ATLAS Tier 2 data should be ‘placed’ with a limited lifetime as needed by the users ( a few months)
- Role and group based quotas are important
- A study group has been launched to define what a Tier-3 and how end-user analysis will work on the Tier-3 sites.
Event Size and Performance (slides 16-17) have been studied and improved between successive releases (Release 12 vs. Release 13):
- ESD Size: Rel 12 (~1700kB) => Rel 13 (~800kB)
AOD Size: Rel 12 (~200kB)
=> Rel 13 (~250kB)
Slides 19-20 show the status of the Database Replications.
Leveraging on the work of the 3D Project Infrastructure the ATLAS Conditions DB replication is now in production with data to all ATLAS Tier-1 sites.
In addition ATLAS users now start using the TAGs Database:
- Support direct navigation to events (RAW, ESD, AOD)
- A selection of e.g. 5% of events via TAG query is really x20 faster than reading all events and rejecting 95% of them.
Slides 21-23 show the tests and the current usages of the TAGs Database.
4.4 Distributed Activities
Distributed Simulation Production continues all the time on the 3 Grids (EGEE, OSG and NorduGrid) and reached 1M events/day recently
The rate is limited by the needs and by the availability of data storage more than by resources. Currently ~50% is done at Tier-2 sites, ~30% at Tier-1 sites and 6% at the Tier-0 (slide 25).
Validation of simulation and reconstruction with release 13 is in progress.
While large-scale reconstruction will start soon for the detector paper and the FDR
For the Distributed Analysis GANGA simplifies running of ATLAS (and LHCb) applications on a variety of Grid and non-Grid back-ends (as shown in slide 26). ATLAS end users are learning to use the appropriate tools (such as Ganga) to send jobs to their input data. Rather than copying files to their local computing clusters and running locally.
The Export from Tier-0 to Tier-1 sites works. Last data throughput tests (slide 28) showed that all obstacles to data export from CERN have been identified and removed:
- An export rate of ~1.2 GB/s could be sustained for prolonged periods using an incomplete set of Tier-1s
- BNL took less than their nominal rate (but we know they can take a lot more)
- ASGC was not included but will join in next time (November) as problems were since fixed
Also the Throughput Tests will continue (a few days/month) until all data paths are shown to perform at nominal rates:
a) Tier-0 → Tier-1s → Tier-2s for real data distribution
b) Tier-2 → Tier-1 → Tier-1s → Tier-2s for simulation production
c) Tier-1 ⇔ Tier-1 for reprocessing
4.5 Distributed Computing Re-organization
The Distributed Computing structure has been reorganized. Until now we had two separate areas within Software & Computing, covering respectively the development and operation of Grid Tools & Services. This structure turned out to be less than optimal to ensure good communication between developers and operators, and also cross-communication between activity areas.
To overcome this situation, we decided to create a “Distributed Computing” project that includes both development and operations activities, within which people can be assigned to tasks in a more flexible way. As in the near future the needs of operations have to set the priorities for everybody:
- Kors Bos, currently Computing Operations Coordinator, will lead the Distributed Computing Project
- Jim Shank will be Deputy Distributed Computing PL
- Massimo Lamanna will be responsible for all development activities
- Alexei Klimentov will be responsible for all operation activities
The first task of these people named above (plus D.Barberis) is to write down, in close consultation with all people currently involved in these activities:
- A description of scope and organisation of the Distributed Computing project
- The global system architecture that can be achieved by mid-2008
- The work plan to get there to that architecture
- The list of deliverables and milestones, taking external constraints into account (M* runs, SRM2.2 readiness, FDR, CCRC, etc)
- The manpower needed and available for each task
As soon as this is completed (in 2-3 weeks), the new organisation will be effective
4.6 Evolution of Production System
During the ATLAS Computing Operations Meeting in the Software & Computing Week it was discussed and decided that the ATLAS production system will evolve towards having just one way of submitting and running production jobs on the OSG and EGEE Grid resources.
A suite of ATLAS and Middleware tools and services (the new names of Pallas and Palette were proposed) will be selected to make this happen.
Two important choices of input to the baseline system were made already during the meeting: the Panda pilot job technology and the Local File Catalog LFC will be used.
While this may have longer term implications for distributed analysis, the decision does not imply that the same tool will be used for that purpose; both the problems to be addressed and the scale are rather different.
In the short term, while the developers work together to turn the currently available set of tools into a coherent and modular system suitable for the longer-term production needs of ATLAS, the production will continue at full speed with the system used till now, with the usual bug fixes and with the minimal evolutions needed for good operation.
It was realised that for NorduGrid this evolution would not be straightforward, as pilot jobs do not really fit that architecture, which is already performing very well. NorduGrid and NDGF support the idea of having just one way of submitting jobs to all the grids.
A complete and concise technical documentation and a proof of concept of the new system must however be provided before any decision can be made. This concerns both the "pilot job" option of submitting jobs and the choice of the file catalogue.
4.7 Schedule and Plans
FDR must test the full ATLAS data flow system, end to end
- SFO → Tier-0 → calib/align/recon → Tier-1s → Tier-2s → analyse
- Stage-in (Tier-1s) → reprocess → Tier-2s → analyse
- Simulate (Tier-2s) → Tier-1s → Tier-2s → analyse
The SFO→Tier-0 tests interfere with cosmic data-taking.
We must decouple these tests from the global data distribution and distributed operation tests as much as possible CCRC’08 must test the full distributed operations at the same time for all LHC experiments.
- 22.214.171.124: Week of 05-09 Nov
- 13.0.4: Week of 19-23 Nov
- 13.1.0: Week of 5-9 Nov (note clash with 126.96.36.199 - should be manageable)
- 13.2.0: Week of 3-7 Dec
- 14.0.X: Staged release build starts week of 17-21 Dec; base release 14.0.0 available Mid-end Feb 2008
- 15.0.X (tentative): Mid 2008
- M6: (Not earlier than) second half of February 2008
- Continuous mode: Start late April 2008 (depends on LHC schedule)
- Phase I: February 2008 (before M6)
- Phase II: April 2008 (before start of continuous data-taking mode)
- Phase I: February 2008 (coincides with FDR/I)
- Phase II: May 2008 (in parallel with cosmic data-taking activities)
5. LHCb Quarterly Report and Plans (Slides) - U.Marconi
U.Marconi presented the quarterly report with progress and plans for the LHCb Experiment.
Please refer to the Slides for the details.
Slides 2-3 shows the Data flow and distribution.
One can notice that all LHCb off-line activities – Reconstruction, Stripping, and Users Analysis - will take place at Tier-1 sites, except Simulation at the Tier-2 sites. All data – RAW, DST, MC - will be both at CERN and at the Tier-1 sites but the rDST – reduced DST – will be at the Tier-1 sites only.
The Users Analysis will use ROOT-tuple files.
5.1 DC’06 and 2007 Activities
DC’06 - since June 2006 - produced 80% of the whole LHCB production. Slide 4 shows the details over time and by site. They also simulated the distribution of Tier-0 data. RAW data was collected at CERN and distributed to the Tier-1s emulating real data taking for reconstruction and stripping at sites.
Since February 2007 onwards::
Events reconstruction at
Tier1s of RAW data files no longer on cache, to be recalled from tape.
- The rDST data output has to be uploaded locally to the Tier1.
Since June 2007 onwards:
Events stripping at Tier1s.
- DST files have to be distributed to the Tier1s.
5.2 Feedback Received
The feedback collected from the users are that:
Reconstruction it is easy for
first prompt processing while it is difficult for re-reprocessing, when files
have to be staged.
Too much instability in SEs was
- Staging at some sites is extremely slow. Problems with the SE software? Problems with the configuration (number of servers, number of tape drives)?
- Some files are not retrievable from tape: registered in our LFC, found using srm-get-metadata but fail to get a tURL (error in lcg-gt).
Shortage problems encountered
with Disk1TapeX at the sites.
Not easy to monitor the
- Need to establish a protocol to get warning from site to set a flag in LFC indicating the replica is temporarily unavailable (not used for matching jobs).
- On our side it may help to tune the number of stage requests issued in one go trying to optimise the recall from tape.
- Inconsistencies between SRM tURLs and root access.
- Problems with ROOT finding the HOME directory at RAL, fixed by providing an additional library (compatibility mode on SLC4).
Unreliability of rfio,
problems with rootd protocol authentication on the Grid (now fixed by ROOT).
5.3 Tests and Development
The SL4 migration was straightforward for LHCb applications. Problems were found with middleware clients used by those applications: dCache, gfal, lfc, etc.
It is essential to test sites permanently with the SAM framework: CE, SE, SRM.
The SRM v2 tests passed successfully. Several plans for SE migration are ongoing at RAL, PIC, CNAF, SARA (to NIKHEF).It is a large effort LHCb have to put, in particular concerning the changes of replicas in the LFC information.
The required VOMS set of groups/roles is not available. With a default set of roles/groups there are still difficulties to have the proper mapping, in particular for SGM and PRD. It induces difficulties in LFC registration (impossible for us to modify the internal mapping of DNs and FQANs, having to go through the administrators).
Slide 10 shows the increasing usage of GANGA and of user analysis jobs since January 2007 to date.
5.4 CCRC08 Goals
The main LHCB goals for CCRC’08 are:
- Test the full chain: from DAQ to Tier-0 to Tier-1’s.
- Test data transfer and data access running concurrently (current tests have tested individual components).
- Test DB services at sites: conditions DB and LFC replicas.
- Tests in May will include the analysis component.
- Test the LHCb prioritisation approach to balance production and analysis at the Tier-1 centres.
- Test sites response to “chaotic activity” going on in parallel to the scheduled production activity.
The tasks that LHCb plans to execute are:
RAW data distribution from
the pit to the Tier-0 centre.
RAW data distribution from
the Tier-0 to the Tier-1 centres.
Reconstruction of the RAW
data at CERN and at the Tier-1s for the production of rDST data.
Stripping of data at CERN and
at T1 centres.
Distribution of DST data to
all other centres
Preparation of RAW data will
occur over the next few month
The activities and goals for the February and May challenges are:
6. WLCG policy on pilot jobs (Paper) - L.Robertson
L.Robertson presented the proposal that he had distributed to the MB.
The issue of pilot jobs has been discussed several times at the MB and GDB meeting, and the proposal distributed collects all the decisions on the subject and asked for the approval of the MB.
The goal of the document is to define the requirements that once met will ensure that pilot jobs are allowed at all sites.
Here is the text of the proposal as distributed before the discussion.
L.Dell’Agnello noted that INFN agrees with the requirements in the document but would like to have an assessment by each site once the requirements are met. Or that an assessment by the sites is done again once the requirements are met, in order to ensure that the solution adopted satisfies the requirements of the sites from the security point of view.
J.Gordon noted that a “pre-agreement” ensures that the sites will accept the pilot jobs once the tests are executed and if the review is positive. If the sites want to keep the choice of refusing the pilot jobs even if the requirements are met then it is useless to execute the tests and the security review.
L.Robertson noted that a new assessment should be limited to check whether these requirements above are met and there are no fundamental issues. A refusal based on new or other issues later would cause the whole pilot issue to restart from zero and would not be compatible with the solutions adopted by the Experiments.
L.Robertson then suggested to first check whether the requirements (1 to 4 above) are agreed.
T.Doyle asked that point 3 should include that glexec must be “analysed, tested and pass the tests with..”.
C.Grandi noted that glexec is part of CREAM and will be certified and reviewed, on its Security aspects, by security experts outside JRA1.
D.Barberis asked that no changes are going to interfere with the current production that are run by the production manager and use a pilot jobs mechanism.
F.Hernandez asked what happens if a batch system has problems with the current implementation with glexec?
L.Robertson replied that in this case that issue should be discussed at the MB again: It all depends on the problem found; in general a technical solution could be studied.
L.Robertson, replying to an email of R.Pordes, noted that OSG is not mentioned because no specific requisites are referring to OSG or to grid infrastructures.
D.Petravick explained that OSG does not think that JSPG is the best body to review the framework. Usually the OSG approval would go via the VDT process (it is not known if VDT has reviewed glexec).
D.Petravick added that for them it would be fine is the members selected for the security are approved also by OSG.
J.Gordon explained that it is not the JSPG that will execute the review. JSPG will choose who the reviewers with adequate expertise will be (security experts, batch experts, etc).
L.Robertson proposed to change the text to the requirement that “the security aspects are approved by the EGEE and OSG security teams.”
Ph.Charpentier asked for a clear timescale by when these pre-requisites are verified. For instance some date should be required by when the LCAS/LCMAPS development, certification and deployment must be executed.
C.Grandi replied that currently a prototype is being tested at NIKHEF, but a clear date is not defined. The release could be for end of 2007 but then certification and deployment will be later.
L.Robertson noted that the MB will have to discuss the timescale and the reviewers’ membership in a couple of weeks and expects information on the LCAS/LCMAPS development and on the names of the reviewers selected by the JSPG.
F.Hernandez asked how this policy will affect the Tier-2 sites and how the information be distributed to them.
L.Robertson replied that the Tier-2 sites will be informed at the GDB.
Coming back to the initial issue, L.Robertson proposed that it could be explicitly specified that the MB will have to decide whether the requirements are met.
The MB approved the proposal.
Update: New proposal distributed by L.Robertson after the MB meeting.
D.Barberis noted that the INFN pledges for 2008, in the Resource Planning Tables, were reduced considerably and this may cause problems for ATLAS and probably for the other Experiments.
He asked that whenever a site changes its pledges the Experiments should be officially informed by the site and they should not discover it by themselves via the Resource Planning tables.
8. Summary of New Actions
The full Action List, current and past items, will be in this wiki page before next MB meeting.