LCG Management Board
Tuesday 23 October 2007 16:00-17:00 – Phone Meeting
(Version 1 - 25.10.2007)
A.Aimar (notes), N.Brook, T.Cass, D.Collados, L.Dell’Agnello, T.Doyle, I.Fisk, J.Gordon, C.Grandi, F.Hernandez, E.Laure, H.Marten, B.Panzer, H.Renshall, L.Robertson (chair), O.Smirnova, J.Templon
Mailing List Archive:
Tuesday 30 October 2007 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous meeting were approved.
1.2 High Level Milestones to Update (HLM Dashboard - 22.10.2007)
Sites should send to A.Aimar, by Friday, 26 October 2006, the update the High Level Milestones because are needed for the QR that will be prepared at the end of October.
1.3 Sites Names (Sites Names)
The MB had agreed that the Sites would send a name to identify each site with a unique name.
See minutes of previous week, at section 1.2. Sites should send that information to the MB list.
- SARA-NIKHEF >>> NL-T1
- CC-IN2P3 >>> FR-CCIN2P3
- FZK >>> DE-KIT
- INFN >>> IT-INFN-CNAF
1.4 2007Q3 QR Reports Preparation (Document)
Simplified QR reports starting with 2007Q3 (Oct 2007) are being prepared.
The document will be prepared collecting material from Sites, Experiments and Projects and will include:
- WLCG High Level Milestones (A.Aimar)
- LCG Services (J.Shiers)
- Grid Deployment Board (J.Gordon)
- Report on all issues (in red) in the HL Milestones, Sites Reliability, Accounting and old incomplete QR milestones.
- A.Aimar will ask for information about each point that needs clarification.
- Sites should respond with text ready for the QR report.
2. FR-CCIN2P3 (ex CC-IN2P3)
4. DE-KIT (ex FZK)
5. IT-INFN-CNAF (ex INFN)
9. NL-T1 (ex SARA-NIKHEF)
11. US ATLAS
12. US CMS
Areas and Projects:
- QR reports as before
1. Applications Area
3. Distributed Databases
4. SRM Storage Services
- Presentations at the MB on activity in the quarter and plans for the next.
- Summarized by A.Aimar in MB minutes and for QR report
2. Action List Review (List of actions)
Actions that are late are highlighted in RED.
I.Bird could not be present at the meeting but he had told A.Aimar that the system is currently installed on the PPS and is being tested and certified.
I.Bird will send more information in an email to the MB list.
· 16 October 2007 - Sites should send the pointers to their documents about 24x7 and VO Boxes to A.Aimar. A.Aimar will prepare a protected web area for confidential documents of the LCG Management Board.
Done. The MB LCG Private Web Area (link) is available and the access is limited to the members of the MB mailing list. Sites can upload their documents. In case of problems contact A.Aimar.
Not done. The only Sites that have sent their acquisition plans are: ASGC, BNL, DE-KIT and FR-CCIN2P3. The others should send them to H.Renshall.
· 21 Octobers 2007 - D.Barberis agreed to clarify with the Reviewers the kind of presentations and demos that they are expecting from the Experiments at the Comprehensive Review.
Ongoing. D.Barberis started the discussions with the Reviewers and with the other Computing Coordinators. He will send a summary via email in the next days.
Below are the main issues and progress of the past week, as reported by H.Renshall.
DCache 1-8 patch 21 - The release of dCache 1.8 patch level 21 was installed at the Tier-1 Sites testing dCache, apart from NDGF. Unfortunately it exposed some configuration problems and new bugs so essentially all dCache Sites, except NDGF, were unavailable for experiment tests in the past week. The developers have now released patches 22 and 23.
DCache 1-8 patch 23 - Patch 23 has been tested at DESY by doing a fresh install and found to fix the patch 21 problems but exposed a new issue related to gridftp. This has been reported and followed up with the developers. The gridftp problem was due to a configuration problem that was solved. Therefore, dCache 1.8-23 has been now released to all Sites for installation.
CERN Oracle problem - Yesterday we had a CERN Oracle problem not understood but that was cured by restarting the database service. The problem has been escalated to Oracle support.
CASTOR and STORM at CNAF - At CNAF, CASTOR became unavailable yesterday and the reasons are not yet clear. CASTOR at CNAF has hardly been used by the Experiments because of its high instability/unavailability. STORM has been working very well.
L.Dell’Agnello added that could be a problem of configuration of the CASTOR stager. They will reconfigure CASTOR on the production instance. In addition, next week they will have the F2F CASTOR meeting at CNAF. The CASTOR developers should be at CNAF and will be able help in solving the configuration problems.
SRM 2.2 Test Instance at RAL - RAL intends to make available its SRM v2.2 test instance for LHCb tests with SRM v2.2. RAL will be tested today or tomorrow by F.Donno to make sure it is adequately configured for LHCb.
ATLAS Transfers - ATLAS has been exercising transfers but without using space tokens. ATLAS will now pause testing till after their M5 detector cosmics run ends, on 5 November.
CASTOR with SRM 2.2 at CERN - Not yet tried CASTOR at CERN as a source with SRM 2.2, but this is now planned for next week. LHCb are ready for this now but need external Sites to receive the data.
J.Templon added that having installed SRM 2.2 at NL-T1 caused the failure of the SAM tests using the information system because was not configured properly. Therefore, in October, NL-T1 will probably be below the reliability targets, even if all systems were working correctly.
4. CCRC Update (Minutes)
There was a planning meeting on October 22nd. The agenda and the attached documents are available (link)
LHCb Targets Presented - LHCB (N.Brook) presented their CCRC'08 targets table. The Feb. to May data volume difference is because the May run has two extra weeks. Same rates but since it runs twice as long there is twice the amount of data. The CCRC LHCb data can then be scratched.
ALICE and ATLAS Targets - Atlas and Alice have yet to submit their targets.
- Alice (L.Betev) will do so by the end of this week. Alice emphasised they will be using their normal production resources as required for p-p running.
- Atlas representative did not attend the meeting as he was involved in FDR planning during the current Atlas week so input is expected soon.
CMS Update - CMS (M.Kasemann) will use mixed mock and cosmics data. CMS knows how to separate these data at the SE level so that mock data can be deleted later.
Other Points Discussed
- F.Donno will present next week the storage resource requirements being gathered from the Experiments.
- H.Renshall pointed out a clarification email from Alice that Alice does not require SLC4 WMS and can run with SLC3.
PIC questioned LFC deployment
for LHCB. Although 3D streams are being replicated to PIC, there is no LFC
front-end deployed. They asked if they should deploy the current version or
the next version where UID/GID is also replicated from CERN. N.Brook will
check but first answer is that PIC should wait.
- H.Renshall reviewed the delivery schedule of the explicit software requirements. An unknown is that for the fully functional glexec - see later talk of J.Gordon. R.Trompert pointed out that NL-T1 has now announced Nov 19 as the date for its dCache 1.8 migration.
- A.Aimar presented his first draft of the CCRC08 milestones and he is now awaiting reactions from Sites and Experiments.
- Next week ATLAS and CMS will be visiting ASGC.
5. NDGF SAM Reliability Tests (Slides)
D.Collados presented the work that is being done in order to add to SAM tests checking the services available at NDGF. The goal is to develop availability tests at NDGF that are equivalent to those executed at the other WLCG Sites.
The information about the SAM tests for OSG and NDGF is at this page:
Below, on gray, are relevant extracts of the web page linked above.
For SE and BDDI the tests are exactly the same. For the CE not all site tests have been implemented.
The extract below shows the equivalence with the standard LCG tests.
The tests without an equivalence are:
J.Templon and J.Gordon noted that if these tests are important for the other Sites then they should also be for NDGF and therefore be verified.
The MB agreed that:
- CE-sft-brokerinfo - no broker at NDGF. No need of a NDGF test.
- CE-sft-caver - this makes sure the latest CA rpms are available. This should be implemented and critical.
- CE-sft-csh – runs a C shell script and verifies. This should be implemented and critical.
- CE-sft-lcg-rm – checks the replica manager’s functions, copying files, moving files, etc. Equivalent tests for ARC should be implemented and be critical. CE-sft-lcg-rm-gfal, CE-sft-lcg-rm-cr, CE-sft-lcg-rm-cp, CE-sft-lcg-rm-rep, CE-sft-lcg-rm-del.
- CE-sft-softver – checks the version of gLite installed on the worker node. No need of a NDGF test.
- CE-host-cert-valid – checks explicitly the certificate of the host as it would be a cause of failure for all tests. This should be implemented.
L.Robertson proposed that D.Collados, Mattias Ellert and J.Templon agree on how the NDGF SAM tests above (or equivalent) are going to be implemented. And inform the MB members.
O.Smirnova asked where is a check-list find what each VO is checking, on which site and which are the tests executed?
D.Collados replied that the critical tests for the VOs are set by each VO in the FCR database. Which services are available at each site is in the BDII. He will send more information about it to O.Smirnova.
J.Gordon proposed to discuss the usage of pilot jobs and the implementation that can be achieved using “glexec”.
6.1 Overview of the Issue
GLExec has been discussed many times but Pilot Jobs are the more general issue. 3 out of 4 LHC VOs want them (and others VOs too) but there are still many issues around trust and security
Pilot jobs which download multiple payloads for the same user, the owner of the job, are not a big issue apart from proper cleaning up between jobs
But Multi-User Pilot Jobs present the problems. Current policies say jobs should run under the identity of their owners. Multi-User Pilot Jobs break this policy unless they can change the identity of the job when they download new payload. GLExec is one solution to this need.
A draft Grid Multi-User Pilot Jobs Policy document exists in draft 0.3 https://edms.cern.ch/document/855383/1. This will go out to wider consultation soon. We hope that it will constrain the actions of the VOs sufficiently to reassure the Sites
It contains the clause:
The details of the pilot job framework are as important, or even more important, than gLExec. We do need the Experiments to document the details, not just provide access to the code.
Significant security concerns related to not switching identities are even bigger. The proxy of the pilot job owner is not protected from the owners of the user payload. A site which does not switch identities is putting the whole Grid at risk. The MB should “encourage" sites running gLExec in the setuid mode.
If user credentials are transferred into the WN, then this has to be done securely This should really be done by proper delegation and with the user proxy being limited. We should require that gLExec is run in at least the Local authz/logging mode, and not allow the "do nothing" mode as there will then be no checked that a user has not been blacklisted. Pilot jobs should not run at Sites which do not run gLExec. “Setuid” mode is contentious but arguably the most secure mode.
Sites are reluctant to run setuid code on the WN. They are gradually coming round to logging mode but this is less secure.
J.Gordon suggested a strong statement from MB requiring support for pilot jobs for site supporting the WLCG VOs.
F.Hernandez asked whether all batch systems used ad the Tier-1 Sites support the change of UID of a job.
J.Gordon replied that this should be verified. And how well glexec works on all batch systems?
C.Grandi added that is also depending on the configuration of the batch system. E.g. PBS configuration can (or not) allow this change of UID.
J.Templon added that in some cases forked processes cannot be handled/killed properly by all batch system if they change UID.
J.Templon proposed that only one framework for the support of pilot jobs, i.e. glexec if is shown to be adequate.
C.Grandi replied that there are already some VOs and Sites that are using the Condor Glide-in solution.
J.Templon then noted that in this case the Condor Glide-in solution could be the common solution.
6.2 Next Steps
- The Grid Multi-User Pilot Jobs Policy needs to be finalised and approved.
- VOs need to publish their pilot job architectures for review by Sites.
- GLExec-on-WN needs to be certified and included in the list of rpms for the WN.
YAIM needs to configure
gLExec and LCAS/LCMAPS to understand and authorise gLExec.
- gLExec status needs publishing in RunTimeEnvironment
- LCAS/LCMAPS server certified, configured and released.
Once we have the first 4 steps complete, VOs can start running generic pilot jobs at selected Sites.
6.3 MB Statements Suggested
From slide 10:
- Sites that wish to support the LHC Experiments need to make a (time-limited?) commitment to support pilot job execution.
- GLExec will be a mandatory part of the experiment pilot job frameworks. Non-compliance will then become a MoU issue between Sites, their funding bodies and the Experiments.
- While Sites may have issues with running gLExec in setuid mode the MB believes that there are security problems running in non-setuid mode and so setuid should be mandatory.
- The MB requires Experiments to publish a description of the distributed parts of their pilot job frameworks for review.
The MB should ask EGEE TCG to
prioritise gLExec-related deployment (how high?).
If this proposal is too draconian the acceptance of pilot jobs should be reviewed and other solutions found.
N.Brook commented that now is too late to change the strategy. The usage of pilot jobs is being de-facto accepted for more than one year at the GDB. Only the security details were sometimes discussed. LHCb has made their security framework available, 18 months ago, to the Security team without any complain about it.
I.Fisk answered that is also known that the Sites are uncomfortable with using pilot jobs since more than one year.
L.Robertson proposed that:
- Glexec should be used
- use setuid to change the user credentials
I.Fisk noted that the functions required are that the site wants be able to book-keep who is running a job. There are maybe safer solutions than “setuid”. Setuid on worker nodes is not well received by all system administrators. F.Hernandez supported this statement.
L.Robertson asked which those other solutions are.
I.Fisk replied that setuid would allow changing the UID of a job several times, or forking other processes with several users. Others, more limited, solutions should be possible.
J.Gordon added that a job should not be able to fork many jobs on a WN otherwise it would use more than one core per job on the node. The forked processes could use the other cores of the host. This behaviour should not be allowed.
L.Robertson expressed his disagreement about changing strategy now. The technical solutions chosen (known since more than one year) should be improved and investigated but not restart from designing new ones. If the current solution is reasonably secure it should be accepted before changing strategy again. The better solutions could be studied in the future.
I.Fisk requested that a careful review of the code should be done; this maybe would help acceptance at the Sites.
E.Laure said that the review is being started in EGEE and noted that glexec is part of the OSG release already.
I.Fisk clarified that glexec has not been used in the CMS instance at FNAL.
I.Fisk will ask R.Pordes for information about glexec with other OSG VOs and sites. And whether OSG has already reviewed the code of glexec from the security point of view.
J.Templon asked what would it happen if a site batch system cannot run glexec properly?
L.Robertson replied that either a workaround is found, or only jobs in which pilot jobs are not needed can use the site until a solution is found for that site.
N.Brook asked that a time limit is set for the reviews and verification steps to be performed; we need an approval in a given amount of time and not come again on the issue after that.
Following the proposal of L.Robertson the MB agreed that the next steps to follow will be:
- The code of glexec should be thoroughly reviewed and validated by the Security group. The EGEE review team will be sufficient.
- The Experiments should publish the description and code of their frameworks and have it reviewed by the Security group and by any site wishing to do so.
- The usage of glexec with all batch systems at the Sites should be fully tested. One must make sure that the batch system can cope with the change of UID as done by glexec (BQS, SGE, PBS, PBS pro, Condor, LSF, etc).
If the steps and verifications above are successful the Sites are officially requested to accept pilot jobs and install the current implementation which is based on gLExec.
J.Gordon and A.Aimar will send a proposal to the MB for an agreement on the acceptance of pilot jobs and glexec.
7. GDB Summary - J.Gordon
Postponed to next week.
2. Summary of New Actions
The full Action List, current and past items, will be in this wiki page before next MB meeting.