LCG Management Board
Tuesday 7 October 2008 15:00-17:00 – F2F Meeting
(Version 1 – 11.10.2008)
A.Aimar (notes), I.Bird(chair), D.Barberis, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, S.Foffano, J.Gordon, A.Heiss, M.Kasemann, M.Lamanna, H.Marten, P.Mato, G.Merino, A.Pace, B.Panzer, H.Renshall, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon
Mailing List Archive
Tuesday 14 October 2008 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous MB meeting were approved.
1.2 QR Preparation (Previous QR)
A.Aimar reminded that by mid/October the QR report (June-Sept 2008) has to be sent to the Overview Board which is meeting on the 27th October 2008).
Action List Review (List of actions)
About LCAS: Ongoing. It will be installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it up should confirm it.
About SCAS: The SCAS server seems to be ready and “certifiable” in a week. The client is still incomplete.
J.Templon summarized the situation. The estimate is that by the day after the MB there is going to be a patch for SCAS.
- DONE. A document describing the shares wanted by ATLAS
- DONE. Selected sites should deploy it and someone should follow it up.
- ONGOING. Someone from the Operations team must be nominated follow these deployments end-to-end
ATLAS will report on the status of the tests. No news at the F2F meeting.
The open issues are:
- What to do for sites using LSF (e.g. Rome)?
- In Italy they tested the gLite “repackaged” by INFN-grid. Needs to be tested on other sites (NL-T1 and Edinburgh will do it).
D.Barberis added that the testing with LSF in May IFNF and CERN SA3 were waiting for each other to do the testing and it is not been done.
In agenda already.
I.Bird proposed that this is discussed at the F2F meeting on the following week. Sites and Experiments can send their comments before then.
In agenda already. Proposal distributed by O.Keeble.
No proposal yet.
On the 22 October there will be a special meeting to prepare the Overview Board.
I.Bird said that the situation is more organized now and the GridView milestones are not needed anymore.
J.Shiers said that The VOMS services have improved in terms of response to progress and can also be removed.
3. LCG Operations Weekly Report (Slides) - J.Shiers
Summary of status and progress of the LCG Operations. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
Two detailed “post-mortem” reports were delivered this week:
- Follow-up on the on-going Oracle problems that have been affecting the CASTOR(and related?) services at RAL
- Concerns networking to PIC
A post mortem is needed - major power incident at NL-T1 Saturday being followed up (no report at Monday’s meeting).
RAL Oracle - Slide 3 shows the report on the Oracle issues at RAL. Is still not solved and reproduced. Therefore the risk is still present and unsolved. This is a major concern.
PIC Network – Slide 4 shows the very detailed post-mortem from PIC. This problem is understood but is repeated. There is not much to do to correct it. The primary cause for all of these problems, and in particular for the complete network outage suffered by PIC between 23/09/08@03:30 and 23/09/08@11:00, were the major power and cooling problems suffered by the Telvent housing centre in Madrid, where the RedIRIS equipment is housed.
NL-T1 FTS Failures – The problem was spotted by ATLAS. A proper post-mortem is needed. The backup power system did not work.
In most cases problems are resolved in a “reasonable” amount of time (but this can be several days) given the frequency (per site). The problems are about one per week and this frequency will likely continue in the future. Should standard action be done if the handling takes longer than a fixed amount if time? One week for the services at the Tier-1 Sites? Two days for a Tier-0 service problem?
These major incidents, and a similar number that got close to being major, were largely unpredictable. Not all of them can be avoided (or limited) by adding redundancy or improving monitoring.
There were 4 major incidents a month (counting RAL as 6 separate CASTOR ATLAS downtimes) with an average impact time of 30 hours (skewed by the long CNAF, RAL and CERN incidents). Six can be attributed to hardware and 9 to software. We must expect similar incidents during data taking and be prepared how to respond to them both at the sites and global LCG service levels
Should channels to the providers be formalized and the actions in case of delays be formally defined? Do we report the top ongoing 5 issues to the MB? Just for information or follow-up?
CCRC09 – February for CCRC09 seems too early for the start in May. The systems will not stay frozen for two months.
T.Cass and J.Gordon suggested to do CCRC in March and April
D.Barberis and M.Kasemann suggested that the coming workshop (13-14 November, see agenda) will define what is going to be executed early next year.
I.Bird suggested that the verification should be done before data taking not necessarily all in one month.
M.Kasemann added that CMS prefers to do the needed tests without a defined computing exercise in one specific month.
D.Barberis stated that the calendar of the upgrades must be known so that Experiments can prepare their tests and schedule.
J.Gordon added that overlap between Experiments need also to be tested and must be planned.
Ph.Charpentier noted that VOs should not try to run at peak rates or top bandwidth if the sites are not yet ready with the necessarily resources. Otherwise the resources are reduced for other VOs that using the same site. Some kind of resources throttling at the sites is needed
4. ALICE QR (Slides) – Y.Schutz
Y.Schutz summarized the QR for ALICE. More details are in his slides (Link).
Monte Carlo - Continuous production in prevision of 2009 data taking: production in T1/T2, data and end user analysis in T2.
Analysis – Several kind of analysis tools are operational: CAF (fast analysis), analysis train (organized) and end user analysis (chaotic)
Raw Data Processing - Online production of condition parameters, first pass processing @ T0, replication in T1s, N pass processing @ T1s operational
Software Releases - Stable release ready for data taking; code evaluation and some refactoring can still be done before the LHC start.
Services - New AliEn version deployed routinely with effectively no downtime. Job management in all its form is used: RB (phased out but still widely used), WMS, and CREAM (very promising initial stability and scaling tests). xrootd-enabled SE continuous deployment
Accounting - Used 40% of allocated CPU and 53% of required. 27% of pledged storage is operational and 64% of that is used
Resources - Requirements for 2008/2009 had been re-evaluated (before LHC incident). New requirements will be prepared for 2009, depending on the LHC schedule; expect larger requirements with respect to C-TDR (CPU, disk)
I.Bird asked why the RB is not phased out by ALICE yet. Support was stopped many months ago and WMS should be used instead.
J.Gordon noted that there are other VOs that are still using the RB.
I.Bird also added that changes of resource requirements for 2009 need to be discussed as is already planned.
5. CMS QR Report (Slides) – M.Kasemann
M.Kasemann summarized the QR for CMS. All details are in the slides (Link).
Slides 2 to 4 show the work done to monitor the performance and availability of the CMS Sites.
For instance below is the availability weighted by resources displayed taking into account the amount of resources at each site.
Slides 5 and 6 show the criteria for site commissioning and slide 8 presents the preparation for the CMS Analysis and how they coordinate which T2 is hosting which physics analysis. This was a delicate and long process in order to reach a clear agreement with each CMS Site.
Monte Carlo production, data processing and distribution are presented in slides 9 to 12.
The CAF functions at Tier-1 and Tier-2 are described below (slide 13).
CERN based CMS scientists use LXBATCH/LXPLUS for individual data analysis. For User output they will use the CASTOR user pool “CMS-CAF-T2”. Others users perform analysis at “their” Tier-2 centre. And user space assigned and allocated by proximity and association.
CMS plan to test another technology for user space in parallel:
- NFS space provided by BLueArc network storage devices
- storage accessible by the large array of collaboration desktop computing systems at CERN
I.Bird noted that the strategy at CERN does not seem to match with the one proposed. But should be discussed at the workshop.
Slide below shows the calendar for the CMS tests (CRAFT = CMS Run At Four Tesla).
CSA08 and CCRC08 Demonstrated all key performances of the T0, CAF, T1 and T2 infrastructu
During the summer CMS:
- Improved infrastructure reliability, production tools (Tier-0,…), monitoring and operations
- Started Computing and Offline Run Coordination and Computing shifts
Routine Cosmics and Commissioning Data taking were performed over the summer
- Processed at T0, Calibration & Alignment performed, distributed on demand to several T1/T2 centres
Production of requested Monte Carlo samples performed routinely
- Huge production of MC Start-up Sample (>200M) started when final software and configuration became available
Computing tasks for the coming months:
- consolidate operations,
- commission Tier-2 sites
- roll-out improved production tools,
- work on the monitoring and fault detection
- configure and get experience with the CAF-T2 resources
Global running and Cosmics data dating will allow for more systematic checks of the whole production and transfer chain.
The Analysis of Data and Monte Carlo will move to Tier-2 centres.
Ph.Charpentier noted that this approach (“local user on local resources”) encourages users to use sites close to them and is against the whole idea of a Grid for executing the jobs. This is not in line with a real LHC grid approach and should be discussed openly at the MB. Other Experiments, other than ATLAS and CMS, will not have the resources to control the resources in the same manner.
M.Kasemann replied that CMS needs some way to decide how resources are used. Physics groups are associated with specific Sites where their data is or where the closest CPU resources are. Tier-2 Sites also provide disk space for their local users. They know each other and better control of overloading for instance.
D.Barberis noted that ATLAS is trying approaches that avoid localization. But Experiments have to try their best model.
J.Gordon added that the model includes that data has to be on some specified sites, therefore cannot be always local.
I.Bird noted that this solution is due to the fact that there is not a mechanism to control quotas and disk usage. Localization should be avoided whenever possible, sites should highlight that the resources are shared in exchange of the participation to the common infrastructure and the international collaboration.
O.Smirnova added that is difficult sometimes to explain that the disk shared at NDGF is in exchange of disk space elsewhere.
I.Bird replied that this approach was clear and approved by the funding agencies when they signed the MoU.
M.Kasemann added that according to CMS monitoring now about 50% f the resources is used by local users and the other 50% is used by remote users.
6. LHCb (Slides) – Ph.Charpentier
Ph.Charpentier summarized the QR for LHCb. Details are in the slides (Link).
LHCb had a stable version and was ready for 2008 data taking. Gaudi based on latest configuration and the applications ready for using survey geometry, conditionsDB. There is still difficulties in commissioning CondDB access at Tier1s.
Plans for the shutdown
- Use opportunity for “cleaning” part of the framework
- Unifying some interfaces and remove obsolete parts
6.1 New ROOT schema evolution. Under investigation (POOL)
Support file-related records. Needed for file summary records (luminosity, statistics for skimming etc…)
More work is being done on interactivity (GaudiPython)
Merger with ATLAS on “configurables”, a Python-based configuration of applications
Studies on multi-core support (parallel file processing) are taking place in collaboration with LCG-AA / PH-SFT
6.2 Commissioning of DIRAC3
Fully reengineered system, the main features are:
- Single framework for services, clients and agents
- Fully integrated Workload and Data Management Systems
- Supports production and user analysis activities. Allow to apply VO policy: priorities, quotas…
- Uses pilot jobs as DIRAC2
Ready for using generic pilot jobs (not switched on yet)
Full scale test with generic pilots will take place in the coming weeks
- New bookkeeping system (also integrated)
6.3 Production activities
Complete simulation and stripping of MC data (so-called DC06 as was launched in 2006) with CCRC-like activity at low rate (10%).
Start 2008 simulation. Mainly for alignment and calibration studies and wait for first data for tuning generators and detector response
6.4 Issues Encountered
Instability of SEs, in particular dCache
Very good response from sites and dCache developers but a permanent struggle due to various causes:
- Software issues (addressed with sites and developers)
- Sub-optimal hardware configuration at some Tier1’s
- Unavailability of files: are in the namespace at site but cannot be accessed or even get a tURL. Damaged tapes, unavailable servers.
The transfers rates are OK (low throughput needed by LHCb: 70 MB/s)
Three severe issues with WMS
- Mixing up credentials of jobs submitted by the same user with different roles
- Limitation in proxy handling (too few delegations allowed) preventing some users to run jobs (e.g. from French CA)
- Misbehaviour of WMS after some idle time: cannot find suitable sites even for a job without requirements!
6.5 Outlook for DIRAC 3
DIRAC3 will be fully ready for first data. Analysis will have fully migrated and by end of September: no dependency any longer on SRM v1 (legacy files).
Still to come:
- Glexec on worker nodes. Will allow exploiting the full power of DIRAC. Allows late binding of jobs, VO policy etc.
- Running analysis jobs with higher priority without site intervention
- DIRAC3 model was certified long ago by the GDB working group on the Experiment frameworks. And is waiting for middleware to be ready (SCAS service)
Commissioning of the alignment and calibration loop
- Setting up an LHCb-CAF (Calibration and Alignment Facility)
- Requirements are rather modest (“simple” detector). Start with just 2 8-core machines, 200 GB of disk
- Full commissioning of Conditions Database update and streaming. Currently very few commits to CondDB
Is not clear whether this is a longer or advanced shutdown. LHCb cannot make much use of cosmics because its detectors are vertical.
More simulation will be needed for replacing 2008 data in commissioning for alignment and calibration and test the Full chain (including HLT)
Calibration and Alignment (CAF)
Survey of Calibration done on the Online monitoring farm. Low rate (~5 Hz) calibration stream to Tier0. Reconstructed promptly (smallish files on D1 storage)
Calibration loop triggered by this monitoring:
- Stop automatic reconstruction
- Run calibration and alignment At CERN (CAF): dedicated queues set up (currently 20 slots)
- Validation of calibration. At CERN (Tier0) using the calibration stream
6.6 LHCb Full Dress Rehearsal
Complete commissioning of:
- HLT (for early and further data). Algorithms, memory leaks…
- Event streaming and real data online monitoring. Started with first beams but…
- Automatic transfers and reconstruction (ŕ la CCRC)
- Calibration loop
The preparatory phase will include:
- Large samples (as available, events can be reused) of minimum bias events
- Online injection software (mimicking Front-End electronics)
And the time scale for this work is:
- Ready in January
- Actual date depends on maintenance (cooling) at the pit. February of March for the test. Typically running every other week, interleaved with pit commissioning activities
6.7 Resources for LHCb
LHCb does not plan to change the resource needs after the LHC incident
The 2008 data was small anyhow, no influence on the future
- Give priority to required activities
- Low pace generic activity. Verify all Grid services (storage, transfers, WMS…)
Reassessing the needs at CERN in particular:
First data and hot checks better done at CERN. Already 25% of analysis anticipated in TDR
Better understanding of calibration and analysis needs (monitoring and validation)
Include software testing and validation activities (non negligible).
LHCb needs to have better control on user activities. How to avoid a single user to exhaust VO’s share?
LHCb is requesting that the resources are not controlled and used by local users. The MoU and in several occasions T1 and T2 said that the resources should be globally accessible and shared.
7. Mandate of the User Analysis working group (Slides) – I.Bird
I.Bird proposed a discussion on launching a “User Analysis working group”. The past discussions involved IT and CERN in order to provide management of disk space for analysis activity at CERN. Basically how to use the CAF. This is solved and now (B.Panzer’s proposal) is to face the general issue to support user analysis in general
Here are some questions that need to e answered in order to clarify the role of the Tiers and the kind of analysis required.
Clarify roles of Tier 1, 2, 3 for user analysis for each experiment
- Batch Analysis – the assumption is that this is grid based
- End-user Analysis – what is this? – laptops, local analysis, etc ?
- č What are the distinctions?
What is the set of services that are needed (in addition to what exists now) to support these analysis models?
- NOT an open door for new development requirements.
- Xrootd as common protocol / interface
- Needs for authentication/authorization
- Configurations of MSS pools etc.
What is a “standard model” for a Tier 2 supporting analysis? E.g. File systems (Lustre, NFS4, gpfs, AFS) with what interface, etc.? How many users? Size per user, etc
D.Barberis noted that user analysis can also be batch or interactive.
For M.Kasemann the distinction is more between “batch vs. Interactive” and “coordinates vs. not coordinated, chaotic” analysis.
7.2 Mandate and Deliverable
Clarify and confirm the roles of the Tier 1, 2, 3 sites for user analysis activities for each experiment, distinguishing between batch analysis and end-user analysis.
Agree on the minimal set of services, or changes to existing services, essential to implement those models. These must be prioritised clearly between “essential” and “desirable”. The expectation is that this is addressing configuration of services, or additional service instances rather than new developments. New developments must possibly avoided or be clearly justified.
M.Kasemann noted that if an agreement must be reached the Computing Coordinators must be involved in the decision. The WG should provide a list but the MB decides on the recommendations. If the work is less than an “OR” of all needs than agreeing will take much longer because someone will have to give up some requests.
The deliverable expected are:
- Documented analysis models
- Report on requirements for services and developments with priorities and timescales.
7.3 Membership and Timescale
Membership proposed is:
- 1 -2 people from each experiment
- 4-6 people from Tier 1 and Tier 2 sites
- Representative from EGEE and OSG sw development activities
- Chairperson, there is a proposal but not contacted
T.Cass noted that also a Tier-0 representative should be included.
J.Templon noted that a security expert should be present if solution agreed are against security concerns of the Sites. We should avoid that. For instance, a choice like “xrootd open to all access world-wide” would not be acceptable from the security point of view.
Ph.Charpentier noted that representative from CASTOR, dCache, xrootd, etc should be present in order to discuss issues knowing what actually can and cannot be done with the current implementations. And in case some software development is needed they can evaluate the difficulty of such development.
P.Mato asked why this working group is necessary and why instead access to the grid is not sufficient. Services and Experiments representatives should be present.
I.Bird replied that analysis has never been discussed and what are the actual patterns for usage and needs. Where is the data is stored and how is accessed for user analysis has never been discussed.
O.Smirnova added that middleware representatives should be there to see what is supportable or not. Will the access be only via SRM or directly to the software below?
I.Bird replied that this is mostly a Data Management issue.
J.Templon added that the group needs to define analysis with the developers of the systems available in order to agree on solution that actually exist.
I.Bird agreed that this is to insure that the solution chosen is supportable by the existing systems and not requiring a lot of development.
L.Dell’Agnello asked why is this is discussed outside the existing current agreement with the Tier-1 sites in the WLCG. SRM is what needs to be used to access and nothing else.
Ph.Charpentier and I.Bird replied that this is not about the protocols to use but is how the data is organized, how many users are going to use it and how many users will access it in parallel.
The timescale to consider is:
- Group should start immediately
- Final report by end of year, monthly reports to MB
14 Oct 2008 - Experiments and Sites (T1 and T2) should send the names of the representatives before next week.
Postponed to next week:
- Requirements and pledges in new CPU units
- Middleware Planning
- Accounting Reports
- Availability and Reliability Reports - Sept 2008
9. Summary of New Actions
14 Oct 2008 - Experiments and Sites (T1 and T2) should send the names of the representatives before next week.