LCG Management Board

Date/Time

Tuesday 19 August 2008, 16:00-17:00 - Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=33708

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 26.8.2008)

Participants

A.Aimar (notes), D.Barberis, O.Barring, L.Betev, I.Bird(chair), Ph.Charpentier, M.Ernst, I. Fisk, S.Foffano, F.Giacomini, J.Gordon, A.Heiss, M.Kasemann, D.Kelsey, U.Marconi, M.Lamanna, G.Merino, A.Pace, B.Panzer, R.Pordes, Di Qing, H.Renshall, M.Schulz, O.Smirnova, J.Templon 

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 2 September 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

1.1      Minutes of Previous Meeting 

About the minutes of the previous meeting M.Ernst noted that in section 3.2:

-       The trial change of the BNL site name was taken back but it propagated without BNL knowing.

-       Currently the sites have to manually refresh their FTS cache and would be desirable that this is automated.

 

The minutes of the previous MB meeting were approved.

1.2      SAM Dashboards (ALICE; ATLAS, LHCb)

M.Lamanna sent the links for the ALICE,ATLAS and LHCb implementations of the SAM Dashboard.

 

2.   Action List Review (List of actions)

  • 9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).

On going. It is installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it should announce it.

J.Templon reported that the SCAS server seems to be ready and “certifiable” in a week. The client is still incomplete.
The server cannot reach the number of connection required and the client is not fast enough to request a large amount of connections (unless is deployed on many more hosts).

M.Schulz added that the reasonable frequency for a site like CERN is of 10 request/sec. Currently it reaches about 4-5 requests/sec.

  • 19 Aug 2008 - New service related milestones should be introduced for VOMS and GridView.

To be discussed at the MB.

  • M.Schulz should present an updated list of SAM tests for instance testing SRM2 and not SRM1.
  • J.Shiers will ask SAM to review the MoU requirements and define SAM tests to verify them.

These actions above should be discussed with M.Schulz and J.Shiers both present.

  • For the ATLAS Job Priorities deployment the following actions should be performed :

-       A document describing the shares wanted by ATLAS

-       Selected sites should deploy it and someone should follow it up.

-       Someone from the Operations team must be nominated follow these deployments end-to-end

 

D.Barberis mailed a status summary (by S.Campana and M.Lamanna) before the meeting.

J.Templon stated that NL-T1 is also willing to participate to these tests.

 


ATLAS is testing the Job Priority deployment in Italy in sites running PBS/Torque (Milano Napoli). The system has been deployed fully at Milano at the moment. Two persons from INFN-ATLAS (Napoli) are in charge to actually run the test (Alessandra Doria and Elisa Musto).

ATLAS has defined a protocol to validate the system starting with basic functional test to a general verification of the behaviour of the system under load (where the share of analysis jobs over production should evolve to the desired value starting from a system running an arbitrary mixture of jobs). This is the same series of tests used in Certification and PPS by EIS people.

All tests have been successfully done in Milano. Results of the last test (clearly the most complex) are still being discussed. We expect to have more information in a month time, also from the Napoli setup.

Further steps:
1) For complete deployment in INFN, need to understand the support for LFS.
2) The Job Priority system should be deployed somewhere outside Italy (INFN does some extra packaging on top of SA3). Sites with sustained load from analysis activity via WMS should be considered. Glasgow is a natural candidate (discussion with Graeme Stewart ongoing).

The ATLAS shares have been communicated in the past and they have not changed yet. Tier1 centres (assuming some analysis taking place in their infrastructure) should have 80-20 (Production-Analysis). Tier2 centre should have 50-50 (Production-Analysis).

 

 

The MB agreed that this action will be verified again at next MB meeting in 2 weeks.

 

2.1      New Actions Proposed

J.Templon added that an action about converting requirements and pledges to new CPU unit proposed from the HEP Benchmarking group.

J.Gordon replied that the issue was going to be discussed at the GDB.

 

Ph.Charpentier added that the list of packages and the upgrade procedures for the Worker Node should be discussed before and then agreed at the GDB.

A mail from O.Keeble contained the link to the proposed procedures.  https://twiki.cern.ch/twiki/bin/view/EGEE/ClientDistributionProposal

M.Barroso is collecting information from the site managers.

 

3.   LCG Operations Weekly Report (Slides) - H.Renshall

H.Renshall presented a summary of status and progress of the LCG Operations. This report covers the last two weeks.

The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

 

Not all sites participate to the meetings. Systematic remote participation by BNL, RAL and PIC (and increasingly NL-T1). The other Sites are encouraged to participate more as holiday season ends and LHC start up approaches.

3.1      Sites Reports

CERN

Following the migration of the VOMS registration service to new hardware on 4 August new registrations were not synchronised properly until late on 5 August due to a configuration error (the synchronisation was pointing to a test instance). For two days there were no registrations recorded, but only one dteam user was affected however.

 

The CASTOR upgrades to level 2.1.7-14 were completed for all the LHC experiments.

 

Installation of the Oracle July critical patch update has been completed in most production databases with only minor problems. The upgrade was transparent to the end users.

 

RAL

Many separate problems/activities have lead to periods of non-participation in Experiment testing (migration dCache to CASTOR, upgrade to CASTOR 2.1.7). Some delays due to experts being on vacation (SRM). Several disk full problems (stager LSF and a backend database).

 

J.Gordon explained that the “disk full problem” was due to the ATLAS instance of CASTOR generating an unsustainable number of queries and a huge growth of the log files that had not been foreseen. It seems due to some retries of CASTOR and was reported to the developers.

 

BNL

The long standing problem of unsuccessful automatic failover of the primary network links from BNL to TRIUMF and PIC to the secondary route over the OPN is thought to be understood and resolved. CERN will participate in some further testing.

 

On 14 August found user analysis jobs doing excessive metadata lookups caused PNFS slowdown causing SRM put/get timeouts and hence low transfer efficiencies. This could be a potential problem at many dCache sites.

 

PIC:

On Friday 8 August PIC reported having overload problems of their PBS master node with timing out errors affecting the stability of their batch system. The node had been replaced on Tuesday so they decided to revert to the previous master node - done Thursday evening - but this also showed the same behaviour. They suspected issues in the configuration of the CE that is forwarding jobs to PBS and scheduled a 2-hour downtime on the following Monday morning to investigate.

 

FZK

There was a problem with LFC replication to GridKa for LHCb from 31-7 till 6-8. Replication stopped shortly after the upgrade to Oracle 10.2.0.4. Propagation from CERN to GridKa would end with an error (TNS error of connection dropped). At the same time GridKa DBAs reported several software crashes on cluster node 1 and since 6-8 we are using GridKa cluster node N.2 as destination. Further investigations are being performed by GridKa DBAs to see whether the network/communication issues with node N.1 are still present.

 

General

LFC 1.6.11 - which fixes the periodic LFC crashes - has passed through the PPS and will be released to production today or tomorrow.

Received a partial fix to the VDT limitation of 10 proxy delegations that makes WMS proxy renewal again useable by LHCb and should help ALICE. But is not the final solution.

 

Ph.Charpentier asked whether the VDT patch has been deployed.

H.Renshall will report on the deployment of the patch outside the meeting.

 

F.Giacomini reported that the problem about the proxy mix-up has been understood but not fixed yet.

Ph.Charpentier noted that this is a real issue because users cannot submit jobs using different roles.

3.2      Experiments Reports

ALICE

Nothing special to report.

 

LHCb

Have published to EGEE the requirement on sites to support multi-user pilot jobs – basically to support a new role=pilot in the Yaim configuration file.

 

Several CERN VOBoxes suffered occasional (each few days) kernel panic crashes with a known kernel issue. Now fixed with this week’s kernel upgrades.

 

An issue has arisen (ongoing at NL-T1) over the use of pool accounts for the software group management function. These require the VO software directory to be group writeable. If the sgm pool accounts are a different Unix groups to the other LHCb accounts they would also have to be world read. Currently affecting some SAM tests.

 

J.Templon noted that the solution implies world readable directories and many site administrators have concerns about it. Some VOs could have the same concern for their applications, configuration and data files visible world-wide.

Ph.Charpentier noted that the issue is there with many sites and for LHCb is not an issue. I reminded that, while NL-T1 has added some scripts to fix these issue, it should instead should be fixed in the general pool account solution.

 

J.Templon proposed that this issue is discussed at next GDB meeting.

 

CMS

During the last two weeks continued the pattern of a 1.5 day global cosmics run Wednesday to Thursday.

 

Failure of network switch to building 613 at 17.00 on 12 August stopped AFS-cron service which stopped some SAM test submission and monitoring information going into SLS. Not fixed till 10.00 the next day.

 

Have been preparing for their CRUZET4 cosmics run which in fact is not ZEro Tesla but magnet on. The run started 18 August and will last a week. Requested for this a new CERN elog instance which exposed a dependency in this service on a single staff member currently on holiday.

 

Preparing 21-hour computing support shifts coordinated across CERN and FNAL. Remaining 3-hours period to be covered by ASGC.

 

ATLAS

Over weekend of 9/10 August ran one 12-hours cosmics run with only one terminating luminosity block. This splits into 16 streams of which 5-6 were particularly big resulting in unusually high rates to the receiving sites (with success).

 

AFS-cron service failure at 17.00 on 12 August stopped functional tests till 10.00 on the next day, due to an unexpected dependency.

 

During weekend of 16/17 August changed embedded cosmics data type name in raw data files to magnet-on without warning sites resulting in some data going into the wrong directories.

Now performing 12-hour throughput tests (full nominal rate) each Thursday from 10.00 to 22.00. Typically a few hours to ramp up and down to/from the full rates. Outside of this activity ATLAS is running functional tests at 10% nominal rate and cosmics at weekends and as/when scheduled by detector teams.

 

ATLAS will start 24 by 7 computing support shifts at end August.

 

4.   Approval of security policy documents (More Information) - D.Kelsey

 

D.Kelsey, representing the JSPG,  asked for the approval of the four documents below:

 

1. Virtual Organisation Operations Policy (See Version 1.6) https://edms.cern.ch/document/853968

 

2. Grid Security Traceability and Logging Policy (See Version 1.9) https://edms.cern.ch/document/428037  

 

3. Approval of Certification Authorities (See Version 2.8) https://edms.cern.ch/document/428038  

 

4. Policy on Grid Multi-User Pilot Jobs (See Version 0.6) https://edms.cern.ch/document/855383  

 

 

I.Bird noted that the only comments received were from OSG and he asked R.Pordes to present them to the MB.

 

R.Pordes explained that the current practices in OSG are consistent with the documents above, except for the traceability and logging policies.

For the other documents the OSG expects to be soon consistent with the JSPG proposals.

 

Concerning identity switching (for the multi-user pilot jobs) R.Pordes added that as ATLAS and CMS approved it at the MB once they request it to the OSG facilities representatives the sites will have to implement it.

D.Kelsey noted that the document does not requires to run multi-user pilot jobs, it only specifies that, once VOs and sites agree on it, the only way to implement it is the one specified in the document No 4 above.

 

I.Fisk added that the policies are correct also for OSG. When implemented it would follow the agreed guidelines. But OSG cannot implement the traceability and logging immediately.

D.Kelsey confirmed that the compliance does not have to be immediate but planned.

 

Decision:

The MB approved the proposed policies. The sites not fully complying should provide a schedule by when they will achieve compliance.

 

5.   End User Analysis -B.Panzer

 

B.Panzer reported that the working group on End User Analysis is reviewing the document he had distributed which includes scenarios, architecture, boundary conditions at CERN.

 

About half-comments were received and are being discussed. They are on a wiki page but not distributed until he is sure they represent the view of the Experiments (in particular the comments for CMS).

 

M.Kasemann confirmed that the wiki page can be distributed to the MB. No comments from the other Experiments.

 

New Action:

B.Panzer will distribute the address of the wiki page with the proposal and comment received about End user Analysis.

 

I.Bird proposed that the working group reports on the F2F on the 9 September 2008.

 

6.   LHC Grid Fest (Slides, Document) - I.Bird

6.1      LHC Grid Fest

I.Bird presented the upcoming LHC Grid Fest event. The document above is the schedule of the event.

This event is outside the MB scope but MB members could be involved.

 

The Slides show the logo of the LHC Grid Fest (slide 1).

6.2      New WLCG Logo

This is also the only opportunity to change the WLCG logo because it will then be included in all brochures and handout given to the press and media.

 

Slide 2 contains several proposal of a new WLCG logo.

I.Bird asked the opinion of the MB about changing the logo or stay with the current one forever.

 

Decision:

The MB discussed briefly about the implications of changing the logo and decided NOT to change the current WLCG logo.

1.   AOB
 

1.1      GDB for Tier-2 Sites - J.Gordon

J.Gordon reported that many Tier-2 are asking for advice about hardware to purchase but this may be outside the GDB mandate or can be a specific topic in the future. But there are obvious topics which are different in substance or scale for Tier-2 than the Tier-1s on which much discussion has been focussed at MB and GDB:

-       Communication to many more sites/people

-       User support for the bulk of users, not the experts.

-       Middleware - which versions and deployment methods

-       Monitoring to improve reliability: has been covered a lot recently but perhaps an example from a Tier2 with remote monitoring and alarming.

 

Experiments could present their model for the usage and configuration for their Tier-2 sites and discuss possible issues.

-       Do the experiments interfere with each other at sites which support more than one? Either in middleware, support paths or middleware versions?

 

D.Barberis noted that the 10 September is not a good date because there is the start-up of the beam and the Grid Fest.

I.Bird and J.Gordon proposed to move the GDB meeting about Tier-2 sites to October.

 

2.    Summary of New Actions

 

 

B.Panzer will distribute the address of the wiki page with the proposal and comments received about End User Analysis.

 

Converting Experiments requirements and Sites pledges to new CPU units.

 

Agree on the software distribution and update at the Tier-1 Sites.