LCG Management Board

Date/Time

Tuesday 2 September 2008, 16:00-17:00 - Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=39169

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 10.9.2008)

Participants

A.Aimar (notes), D.Barberis, M.Barroso, I.Bird(chair), T.Cass, Ph.Charpentier, M.Ernst, X.Espinal, I. Fisk, S.Foffano, F.Giacomini, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, O.Keeble, U.Marconi, P.Mato, M.Lamanna, A.Pace, R.Pordes, Di Qing, M.Schulz, J.Shiers, R.Tafirout, J.Templon 

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 9 September 2008 16:00-18:00 – F2F Meeting

1.   Minutes and Matters arising (Minutes)

1.1      Minutes of Previous Meeting 

The minutes of the previous MB meeting were approved.

 

2.   Action List Review (List of actions)

  • 9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).

About LCAS: Ongoing. It will be installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it up should confirm it.

About SCAS: The SCAS server seems to be ready and “certifiable” in a week. The client is still incomplete.

M.Schulz noted that SCAS is still running at half the rate expected (10 Hz) and therefore is not ready yet.

  • For the ATLAS Job Priorities deployment the following actions should be performed :

-       DONE. A document describing the shares wanted by ATLAS

-       DONE. Selected sites should deploy it and someone should follow it up.

-       ONGOING. Someone from the Operations team must be nominated follow these deployments end-to-end

 

ATLAS will report on the status of the tests at the F2F next week.

  • 2 Sept 2008 - B.Panzer will distribute the address of the wiki page with the proposal and comments received about End User Analysis.

Done after the meeting. The wiki page is here: https://twiki.cern.ch/twiki/bin/view/Main/EndUserAnalysisScenario

  • 19 Aug 2008 - New service related milestones should be introduced for VOMS and GridView.

To be discussed at the MB in the future.

  • M.Schulz should present an updated list of SAM tests for instance testing SRM2 and not SRM1.
  • J.Shiers will ask SAM to review the MoU requirements and define SAM tests to verify them.

These actions above should be discussed with M.Schulz and J.Shiers both present.

 

3.   LCG Operations Weekly Report (Slides) - J.Shiers

J.Shiers presented a summary of status and progress of the LCG Operations. This report covers the last two weeks.

The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      General

Participation in daily meetings has remained rather constant over summer. BNL, PIC and RAL are most regular remote Tier-1 participants, also GRIF (T2), NIKHEF and IN2P3. Not all Tier-1 Sites participate.

 

The meetings have proven to be a valuable way of ensuring information flow and follow-up / dispatching of problems – will be even more essential in coming weeks, but need to adapt to realities of data taking and processing.

 

Based on the length of the daily minutes and number of issues alone, the activity is greater than during formal runs of CCRC’08. How will this change in real data taking? (X2? x10?).

 

The request to move the meeting time slot seems not possible. Later times would clash with existing meetings on at least 3 day, earlier is not practical for North American sites.

3.2      Service Issues

Still some gaps in handling of routine service issues (interventions, priority follow-up etc.)

-       Treatment of online DBs – use of IT status board like all other services

-       Coordination of DB service issues for external sites running CASTOR, use the same version everywhere

-       Some missing monitoring / alarms – e.g. network switches

-       Triggering emergency follow-up is possible – use it when needed!

 

Upgrade / interventions are still continuing at a very high rate. It is clear that we cannot ‘freeze’ services, but are all of the planned / wished interventions really needed?

 

Post-mortems now regularly produced. The level of detail and timeliness in producing these is now good.

 

Service coverage – taking remaining - vacation is still an issue.

3.3      Sites Issues

CERN Router Problem - From 10:00 on Friday 22 August switch in CERN CC caused overload putting router in bad state. Led to degradation in CASTOR service over w/e which meant first events (LHCb) were inaccessible.

 

CASTOR instance at RAL for ATLAS – bulk insert constraint violation – re-visit of cached cursor syndrome? Still being followed up.

 

CASTOR issues at CNAF – use of ALARM procedure but follow-up delayed – ticket arrived only the day after due to some bug in the ROC ticket system.

 

Oracle version for CASTOR - How / where is this coordinated? (T1 DBAs participate in 3D calls).

 

CERN-BNL OPN issues. This ‘link’ seems to have had more problems than others for quite some time (months)

 

CERN DB issues – online DBs play a critical role, as can be seen from daily service reports.

3.4      Services and Experiments

This ‘view’ tells a similar story – for time reasons no summary of the detail from the last two weeks. Also, it seems that the daily minutes are now rather widely and regularly read (and linked to the agenda page of today’s MB)

 

What level of detail is required in these regular updates?

3.5      Conclusions

Had we been taking data this summer it would have been somewhat painful, to say the least. But we will be taking data – from pp collisions in the LHC – in a matter of weeks from now.

 

And very soon we have to start preparing for next year (many many upgrades in the pipeline! – And we didn’t yet finish the previous round (e.g. FTS still not on SL4).

 

But this will be fundamentally different to the past – as reprocessing and first results (!) will also be very high on the agenda.

 

I.Bird noted that in the future some sites should avoid going completely down for hours or days (plus draining the job queues, restarting them, etc). But instead do as many intervention as possible is a transparent way using roll-over techniques. Some sites manage to never go down and therefore this all planned operations must be scheduled and organized across sites.

Also after hours support should be clarified at different sites.

 

4.   Issues with lcgadmin (sgm) Pool Accounts

 

Ph.Charpentier summarized the email he had distributed to the MB.

The whole mail threat is available (login required) here: https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

 

The lcgadmin role (a.k.a. sgm) is used by LHCb for managing the software local repository. The requirements are the following:

-       these users have to be able to manage the SW (i.e. install and remove software that are available as tar ball created by the VO release manager)

-       the SW repository (a.k.a. VO_LHCB_SW_DIR) has to be accessible (read/execute) to all users in the VO, but no write permission should be granted to any user but sgm in order to avoid accidental modification of the repository.

 

The implication of using pool accounts is that a UNIX group separate from the main VO group has to be created to which all the sgm pool accounts belong. The SW repository must thus have permissions rwxrwxr-x. 

 

Limitations however are due to combined "features" of UNIX permissions and tar utility:

-       Tar can only set permissions that are more restrictive than the permissions set inside the tar file. Usually files inside the tar file do not have group-write permission, or other-read.

-       only the owner of a file can change its permissions. Even if a file has group-write permission, a member of the group cannot modify the permissions

 

The result is that the job that installs the software has to set permissions recursively to the files it just installed, depending whether the UNIX account it runs for is a pool account or not (assuming a pool account terminates with digits). If all goes well, it usually works, provided correct permissions and umask are set by the site admins...

 

However due to the fact that it is not possible to create the files and set the permissions in an atomic operation anything may happen (and does happen) during the installation. If the job fails for any external reason between these two operations (sometimes untar fails due to transfer corruption of tarfiles, jobs fail due to a network problem accessing the SW repository, WN failure...), the repository is left in a weird state with improper permissions.

 

If a following job attempts to either set permissions or remove these files, but is not running as the same initial UNIX user, it fails, due to the fact that only the owner of a file can change permissions. The result is that the site can't be used and we have to issue a GGUS ticket to the site in order to get the repository cleaned up, and reinstall everything from scratch. Our SW manager is currently struggling with more than 20 open GGUS tickets due to problems with the SW repository due to pool accounts (no problems for static accounts).

 

An alternative would be to create the tarfiles with permission 777: this would allow applying the umask to set proper permission when untaring. But we are very reluctant to do this as if users untar them without care it can create real security holes (more severe than having world-read permission). 

 

Another possibility is to use a single DN for lcgadmin, but this is just "using static accounts" at our level. We don't want to artificially borrow someone else's credentials as software management is done by more than one person.

 

For this reason we would like to revert to static account for the lcgadmin role as a baseline at all sites. The risk of compromising the whole LHCb VO in case of misbehaviour of one of our few SW managers we are ready to take as these people are entrusted at our level. Anyway if one of these persons would be willing to behave maliciously, they could make much more damage in corrupting our software repository or introducing malicious executables. We have to trust them.

 

Could this issue be discussed in the near future and settled once for all. Of course if smart people can provide us with a safe solution that would have escaped our developers, we are more than happy trying it out.

 

 

He highlighted the fact that it is not possible to install files and set the permissions in an atomic operation. Anything may happen (and often does happen) during the installations. If the job fails for any external reason between these two operations (sometimes untar fails due to transfer corruption of tarfiles, jobs fail due to a network problem accessing the SW repository, WN failure...), the repository is left in a weird state with improper permissions and the VO cannot intervene anymore.

 

LHCb’s request is to move to a single static account for installation, as solution for all sites. A single user account instead of a single DN mapped differently at sites.

 

F.Giacomini proposed to proceed with the installation in a temporary location and then move the whole directory tree with a single atomic operation. This solution will be investigated.

 

J.Templon noted that NL-T1 opposes the fact that many user can be mapped to or use a single user account. Is a general security issue that the NL-T1 security team does not accept.

The proposal of using the installation via a normal job submission is acceptable because it keeps traceability.

 

Ph.Charpentier noted that a certificate or account for the service is needed because will not always the same user available for the operation. Any misbehaviour can happen with installations via a sgm account or via a normal job submission. Is the VO that covers this security issue and the role is used by 4-5 people for job submissions. The solution by F.Giacomini will be investigated and seems promising; still more than one person must be able to do the installations.

 

I.Fisk noted that if the solution is via an installation portal (as proposed) still the user must be allowed by the VO and will be more difficult to identify which is user was mapped to the sgm account. CMS will have a single UID to install their software in the US until another solution is found.

 

J.Templon explained that the portal is a single machine where the job installations can be launched so that, for instance, all logs are on that single host.

 

D.Barberis noted that ATLAS will not change their way of installing software in the next months and the current methods work\s and is used since two years. The same way Sites and VOs trust software developers and install and use the different components, sites should trust that the VOs monitor the security of the installations. Nobody is currently checking that the software and data files do not contains material that should not be distributed or is insecure.

 

I.Bird proposed that because is currently an issue with NL-T1 and the installations of LHCb than should be clarified without changing the general installation methods.

 

Ph.Charpentier noted that if NL-T1 responds in a couple of hours to their requests for support then it is acceptable for LHCb.

J.Templon stated that if the software area is used improperly or has problems it should not result in the site considered "down" because is a VO's fault.

 

5.   Client Software Distribution at Sites (Link) - O.Keeble

 

O.Keeble distributed a proposal on how to install software updates at the Sites. https://twiki.cern.ch/twiki/bin/view/EGEE/ClientDistributionProposal

 

The idea is to use existing mechanisms to install software to the Worker Nodes and solve some of the current distribution problems (installation at sites can take months of waiting, and multiple versions are not installed currently). One could instead be able to quickly distribute and roll back, if needed, the software. And one could also maintain multiple versions of the middleware available to the VOs.

 

J.Gordon, F.Hernandez  and J.Templon pointed out that some sites do many tests themselves before upgrading because there are also other VOs than the LCG ones and because there are different varieties of WNs (hw and sw) at their Site.

 

M.Schulz noted that now VOs ship some of the middleware software in the versions they need with their own software. But in case of problems the site does not know exactly which versions were where used when the problems occurred.

 

J.Templon requested that the current installation and distribution mechanism will be available in addition to any new method. NL-T1 needs to run tests and certify the middleware for other VOs. The versions should be available in automatic or in the traditional formats.

 

J.Gordon also requests that bug fixes are provided both for the latest versions distributed and for the version distributed and installed at the sites with the current methods.

 

R.Pordes asked whether OSG is involved in this activity or only for the EGEE sites. If it involves OSG A.Roy should be contacted.

O.Keeble replied that currently this applies to the gLite sites only. OSG could use the same strategy for multiple versions if they want. He will contact A.Roy.

 

A.Heiss asked that these automatic installation failures should not be automatically considered caused by the Site, nor by the VO.

 

F.Hernandez suggested that specific versions needed by the VO are installed in the areas of the VO. Sites do not really need to deploy many versions.

M.Schulz agreed that the VO-centric solution is also possible and actually now VOs install their versions. Multiple installations, one per VO, of the same version are not a big problem.

 

D.Barberis noted that also services/servers need to be upgraded to match the versions used by the experiment (e.g. ATLAS software using a given version of the dCache client needs the installed dCache services to support it).

 

I.Bird concluded that a proposal should be discussed with the Experiments at the Architect Forum in the Application Area. O.Keeble will report on it.

 

6.   Proposal of new WLCG web (Link)

I.Bird reported that only a positive comment was received and the new WLCG web site will be moved in production during the week.

 

If there are issues please email to lcg.office@cern.ch

 

 

7.   AOB
 

 

Next week the MB will review the HL Milestones. (Link) please send information to A.Aimar.

 

A.Heiss noted that 24x7 Support is now operational at FKZ.

 

 

 

8.    Summary of New Actions

 

 

No new actions.