LCG Management Board
Tuesday 11 September 2007 16:00-17:00 - Phone Meeting
(Version 1 - 14.9.2007)
A.Aimar (notes), I.Bird (chair), T.Cass, L.Dell’Agnello, F.Donno, J.Gordon, I.Fisk, M.Kasemann, M.Lamanna, D.Liko, H.Marten, P.Mato, G.Poulard, Di Quing, Y.Schutz, J.Shiers, O.Smirnova, , J.Templon
Mailing List Archive:
Tuesday 18 September 2007 16:00-17:00 - Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous meeting were approved.
2. Action List Review (List of actions)
Actions that are late are highlighted in RED.
ALICE not done.
CMS@INFN Done. M.Kasemann explained that CMS now reviewed the critical tests because there were some tests failing at CNAF. A.Sciabá should have fixed that.
ATLAS@IN2P3 Done. By email F.Hernandez explained that ATLAS and IN2P3 have investigated the issue and it will be solved. The failures were due to the attempt of some tests to create some files in an area where they did not have write permissions.
1. SRM Update (More information; Text) – F.Donno
For convenience here is the text (in blue) that she presented, together with the comments and discussions that followed.
SRM 2.2 status - 11 September 2007
1. Main software problem: calls that are not compliant to the agreed SRM specification.
2. Main documentation problem: documentation for configuration and management
3. Ongoing work to get sites correctly configured for experiment tests
The week at CHEP was extremely useful to resolve many open issues.
Status of the implementations
All stopping problems fixed (version 1.1-1). Files in multiple spaces still to be fixed.
Minor bugs should be fixed with the next release within the coming week.
Showstoppers: Still many calls not SRM compliant.
Missing documentation for configuration and management resulting in sites still not working.
No tools to manage files in spaces.
1. We agreed on the following plan:
- First help the sites to be well configured for ATLAS.
- Working on addressing the SRM requests that are still not compliant. We have
identified the blocking calls (details are given here)
- Flavia + sites will produce configuration instructions summarizing what has been
explained by the developers during the CHEP dCache BOF and the hints given by the
developers for ATLAS configuration.
- Information about files in spaces can be retrieved and/or added in dCache billing files
working first on what reported by the sites. Work in this direction is ongoing involving Lionel and Flavia.
No showstoppers. However there are calls non SRM compliant that are important to fix. To be solved within this coming week.
Status of the sites
A page has been made available summarizing the open problems for the site: Link
Sites available for ATLAS tests:
CASTOR@CERN with SRM v1
StoRM@CNAF with SRM v2 and DISK space token *not published in BDII*
Sites available for LHCb tests:
CASTOR@CERN with SRM v2
dCache@SARA with SRM v2 and LHCb_RAW and LHCb_RDST
dCACHE@Edinburgh with SRM v2 and LHCb_RDST space token (no tape!) *not published in BDII/not passing S2 tests*
dCache@FZK with SRM v2 and LHCb_RAW and LHCb_RDST *not published in BDII*
dCache@IN2P3 with SRM v2 and LHCb_RAW and LHCb_RDST *not published in BDII/not passing S2 tests*
L.Dell’Agnello asked whether LHCb has been testing StoRM at CNAF.
F.Donno replied that tests are being prepared. She is in contact with N.Brook and will ask him about some information about the configuration of the space tokens.
H.Marten asked whether the SRM installation at FZK (SRM-20) can be done by end of October.
J.Shiers replied that CMS CSA07 will finish only at the beginning of November and the dCache developers are already planning to do the installation on the 5th November.
M.Kasemann added that 29th October is the end date but a few days more should be kept for contingency.
The final agreement was that the SRM 2.2 installation at FZK will be on the week of the 5th November.
2. Combined Computing Readiness Challenge (CCRC) 2008 - Status & Plans (Slides) - J.Shiers
J.Shiers proposed to start the work on the preparation of the CCRC 2008, as agreed at the WLCG Workshop.
The months proposed are February and May 2008. Both months will be used. Issues found in February will be fixed in April-May and re-tested in May.
In addition sites and experiments must ensure adequate resources (human and computing) for these crucial periods.
He then proposed that, for the organization of the Challenge, a Coordination team is formed with the following participants:
WLCG overall coordination (1)
- Maintains overall schedule
- Coordinate the definition of goals and metrics
- Coordinates regular preparation meetings
- During the CCRC’08 coordinates operations meetings with experiments and sites
- Coordinates the overall success evaluation
Each Experiment: (4)
- Coordinates the definition of the experiments goals and metrics
- Coordinates experiments preparations
for load driving
- During the CCRC’08 coordinates the experiments operations
- Coordinates the experiments success evaluation
Each Tier1 (nT1)
- Coordinates the Tier1 preparation and the participation
- Ensures the readiness of the center at the defined schedule
- Contributes to summary document
18 September – Sites and Experiments will send to J.Shiers the name of their representative in the CCRC Coordination team.
3. NDGF Tier1 (Slides) - O.Smirnova
O.Smirnova reported the status and plans of the Nordic Data Grid Facility (NDGF), which is a WLCG Tier-1 site.
At the moment NDGF in term of sites:
- Involves 7 largest Nordic academic HPC centres, plus a handful of University centres
- Is connected to CERN directly with GEANT 10GBit fibre
- Inter-Nordic shared 10Gbit network from NORDUnet and a dedicated 10Gbit LAN covering all Tier-1 nodes next year
The status of other services is the following (see more details on the slides):
CE - They support the CE (slide3) on several OS installations (RHEL3, RHEL4, Ubuntu6.06, RedHat9, SLC4, CentOS4, FC1 and support several CE flavours.
WN - The Worker Nodes (slide 4) typically run on the same OS as the CE(RHEL3, RHEL4, Ubuntu6.06, RedHat9, SLC4, CentOS4, FC1) and are managed by local batch systems (not via a single entry point), without any grid middleware installed. Normally the WM do not have inbound connectivity and outbound connectivity is sometimes limited to certain IP ranges by system admins.
J.Gordon asked whether the NGDF sites submit their accounting data separately.
O.Smirnova replied that all resources are aggregated (manually) in a single “NDGF Tier-1” centre in APEL.
SE - Distributed over wide area running dCache1.7-ndgf (special version extended for distributed storage in NDGF). It uses GridFTPv2 and runs the pools –all the abovementioned systems (RHEL, CentOS, etc) It is monitored by same SAM tests as any other dCache SE
The SRM is accessed via a single NDGF Tier1 storage entry point: “srm.ndgf.org”.
VO Boxes - One VOBox for ATLAS that deals with data management only. 7 individual VOBoxes for ALICE (at each ALICE CE) plus one central that deals with data movement (xrootd-dCache). Not monitored by VO SAM tests currently.
FTS - Standard gLite FTS 2.0, stand-alone installation, depends on gLite, needs SLC3. But needs to be patched with NDGF GridFTPv2 patches. Not monitored by SAM currently
RLS - Central stand-alone installation running on Ubuntu. It is currently used instead of LFC because runs only on SLC. And the LFC usage would have meant dramatic changes to all the clients. Not monitored by SAM currently
sBDII - Stand-alone standard gLite BDII. Not used by NDGF-T1 services, but needed by external gLite services (monitoring, FTS). Monitored by a standard SAM test
I.Bird noted that all Tier-1 should have all the SAM tests for all the services. The SAM tests are mandatory in order to correctly monitor the level of service provided to the Experiments.
O.Smirnova agreed that the SAM tests should be all implemented. But for now the resources are all assigned to the installations and the set up of the services. They will complete the SAM tests later. The review panel that is checking the tests for OSG and NDGF will report on which SAM tests are missing.
3D - Currently no RAC setup –was shown to be sufficient that far. Upgrade to 5-node RAC next year.
Internal Monitoring – Every listed service is monitored by Nagios with custom tests. An Operator-on-Duty (OoD) is established and covers office hours currently, will soon be extended to include off-hours, and be on-call. All together, ~20 persons are involved in daily operations
Tier-2 Relations – Currently they work mostly with Ljubljana, but other Nordic Tier-2 sites are being formed.
O.Smirnova concluded reporting their latest achievement: on September 6, the “extended” Tier1 processed 975 wall time-days of successful ATLAS production jobs (involving all the internal data movement). An average of 900 wall time-days every day for ATLAS alone is a recent norm.
J.Gordon asked whether some of the sites that are in EGEE publish their data in addition to publish it for the NDGF Tier-1 site. Is it a double counting?
O.Smirnova replied that the data is currently reported manually and sites that are also in EGEE are counted separately in EGEE because are reporting about different resources (non WLCG).
4. Job Priorities Working Group (Slides) – D.Liko
D.Liko reported on the status and plans of the Jobs Priority working group.
4.1 Short-term Solution
The current work (for the short term plan) is based on the model discussed last year:
- Minimal solution, not scalable to many groups
as simple as possible. For example do not allow wildcard notations for VOMS
There is a note by S.Campana and A.Sciabá that contains all the details (see Word document)
The work required is in principle done, only still requires:
1. Verify the configuration script (yaim)
2. Verify the general Information Provider
3. Test it on the certification testbed
4. Afterwards move to preproduction service
Due to few minor issues the implementation could not yet be tested on the certification testbed.
All experts have returned from holiday and now should be possible to complete certification and testing.
4.2 Medium-Term Solution
The TCG has taken up the issue revisiting the usage and role of VOMS in the system. This issue goes beyond the Job Priority WG, as it includes for example also Data Management issues It is clear that any development that goes beyond the short time solution has to take into account any conclusion from this work
The JP working group could organize some “Brainstorming” sessions in September
- Transmission of parameters to the batch system. In function of the new CE.
- Avoid Unix groups to transmit role to batch system
- Other ideas
I.Bird asked when the tests will start on the test bed.
D.Liko replied that this is imminent and is followed up weekly.
I.Bird asked that a report on the progress should be reported to the MB soon, in order to make sure that there is progress.
18 September - Next week D.Liko will report a short update about the start of the tests.
J.Templon asked to be informed of any brainstorming sessions on the topic; he was unaware of this decision.
I.Bird replied that the TCG had only agreed that some discussion and brainstorming will be considered. But this work will go beyond the JP working group and a list of participants will be prepared in a week.
J.Gordon asked about the function “Transmission of parameters to the batch system. In function of the new CE”. Is this function present in the Cream-CE? And in the LCG CE?
I.Bird replied that the passing of the parameters is a function of BLAH, which is used by gLite CE and Cream-CE. There is not resource to add it to the LCG CE at the moment.
J.Templon asked about the status of the document by Campana and Sciabá because some feedback he had sent does not appear in the document.
D.Liko replied that he will make sure that the feedback is included and the latest version will be circulated to the MB mailing list.
21 September - D.Liko sends to the MB mailing list an updated version of the JP document, including the latest feedback.
6. Summary of New Actions
The full Action List, current and past items, will be in this wiki page before next MB meeting.