SLC5 Pilot Service: Description and Status.


  • Start Date: 1 Nov 2008
  • End Date: 15 Mar 2009
  • Description: Pilot Service of gLite SLC5 @ CERN-PROD
  • Coordinator: Ulrich Schwickerath
  • Contact e-mail: na
  • Status: Closed

Description

Since 6/10/2008 Users have the possibility to evaluate SLC5 interactive and batch services at CERN. The entry point is an alias called lx64slc5. Please note that this service is still experimental, and that there are some known problem which are listed on this page.

Feedback should be send via helpdesk to the service managers of batch and interactive services. Before sending feedback, please check the list of known problems in https://twiki.cern.ch/twiki/bin/view/FIOgroup/Lx64SLC5Issues

Important changes

  • enhanced security. Several measures to improve the security of the systems have been applied, and are still being implemented. This may affect the look-and-feel of the service.
  • SELinux is enabled by default now.
  • User accounts are now served from an LDAP server. Most of the 20000+ AFS accounts are no longer present in /etc/passwd. This will affect utilities that use standard methods to look-up password-file entries.
  • pine has been replaced by a new product called alpine

Overall Planning

  • Use cases:
SLC5 resources will gradually replace SLC4 resources. It is planned to install new worker nodes arriving at CERN with SLC5.

  • Objective and Metrics:
The objectives of this pilot are:
    • installation and verification of an SLC5 based testbed
    • test the production readiness of this architecture
    • offer a platform to the experiments for certification of their software

  • Timelines:
The production release of SLC5 marks the end of this pilot. This is forseen for end of January 2009, when all new CPU resources will be installed with SLC5 only.

  • Detailed planning:
    slc5pilotplan.gif

  1. By the 12th of December: upgrade to the new version of WNs.
  2. week 15-19 Dec: test by the experiments. Two days (Tuesday and Wednesday reserved to LHCb)
  3. Xmas break: draining two CEs from the production set
  4. January: ramp-up of the resources with new nodes (to be delivered in January)
  5. end of January: formally opening the production SLC5 subcluster (end of the pilot. ce110 is released for other tests)

Technical documentation

Installation Documentation

Installation of these services will follow the procedures which are used by CERN-PROD. That is, quattor-managed.

A second release of the combined x86/x86_64 SL5 WN is available now (this repository is temporary and mirrored by Quattor, so it may become unavailable in the future)

Yum repo:

[glite-WN]
name=gLite 3.2 WN
baseurl=http://grid-deployment.web.cern.ch/grid-deployment/glite/test/R3.2/glite-WN/sl5/x86_64/
enabled=1

Pilot Layout

At the pilot start, two batch nodes and two lxplus nodes are available. The batch nodes offer in total 18 job slots. It is planned to increase the available resources significantly in the coming weeks and months.

ce110 is configured to submit jobs to the SLC5 batch resources.

Update 2008-Nov-11: the set-up is now based on the 64bit SLC5 version of the WN in a beta version

Update 2008-Dec-01: on request of CMS CE110 will be published in prod-bdii in "Production" status. This was proposed in the Grid operations meeting on 1/12/2008, and no objections were raised. The change becomes active on 2/12/2008 in the morning. Users should match the wanted OS version in their jobs to avoid problems.

Update 2008-Dec-19: CE110 is back to Preproduction status after it had been reserved for two days for LHCb tests, publishing production mode. CE128 and CE129 have been put into draining mode. After the Christmas break, it is planned to convert them into SLC5 submitting nodes in production mode. The number of worker nodes has been increased. The new resources will hopefully be made available for the Christmas break. In total this corresponds to 522 job slots.

Update 2009-Feb: Currently, we have in public lxbatch 984 ksi2k (in wmperf) in 64bit SLC5 and 7693 in 64bit SLC4, so some 12% of the current public capacity. Note that new nodes will go into SLC5/64bit only, as soon as they become available.

Tasks and actions:

Actions are tracked via the TASK:7981 available from the PPS task tracker

Results

Feedback from the experiments

Alice is running production jobs on the node and the first feedback is very positive.

CMS has run a small production with ProdAgent (55 events, 11 jobs) without finding any problem.

LHCb tested AFS token grabbing which is working. They are investigating an issue they have seen with DIRAC3. Binaries can run with some libraries installed. A tarball was created by the VO including those missing libraries . The VO is ready to install the libraries with the application but the release team is considering he option of distributing them with the WNs.

At the Architects Forum recently, though, the need for a rapid migration of worker nodes to SL5 was again reiterated: "Every experiment is interested on a transition to SLC5/64bit of the Grid resources as soon and as short as possible."; full minutes available at http://lcgapp.cern.ch/project/mgmt/AFMinutes20090219.html .

Comments and issues from operations

History

11-Nov-08: Upgrade to a newer version of WNs

11-Dec-08: existing nodes were upgraded with the new WN software, based on Java 1.6 and VDT1.10

12-Dec-08 The tentative end-date of the pilot was agreed to be the end of January

17-Dec-08: two days of testing were reserved to LHCb as agreed and the CE ce110 is now back publishing 'Preproduction' state. Alice can start to use the CE again;

17-Dec-08: we have now 58 nodes behind this CE. They are expected to go into production before Christmas;

19-Dec-08: production release of 58 dual quadcore SLC5 WN in total

06-Jan-09: migration of ce128 and ce129 to SLC5 submission. Keeping the nodes closed and in scheduled downtime. The state change to production is planned for 19-Jan-09 when the scheduled downtime ends.

13-Jan-09: BUG:45852 opened for LD_LIBRARY_PATH incorrectly set

Topic attachments
I Attachment History Action Size Date Who Comment
GIFgif slc5pilotplan.gif r1 manage 10.3 K 2008-12-12 - 17:37 AntonioRetico Detailed planning
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2009-03-04 - AntonioRetico
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback