ATLAS Edinburgh ECDF Project Tasks

Middleware Service List

mw05 CE primary lcg-CE Decommission once lcg-CE is not supported, or after 2 cream CEs are in production
ce1 CE cream CE Real host, provision on "mon" host
ce2 CE secondary lcg-CE Decommission once 2 cream CEs are in production
ce3 CE cream CE virtual instance, provision before 31/12/10 DONE
ce4 CE arc CE for scotgrid tests, provision once mon2 decommissioned
srm SE DPM SE update service to run on SL5 pivot on a new server
srm (pivot) SE DPM SE Real host, install SL5 DPM and switch DNS
se2 SE DPM SE test SE, decommission
se3 SE Storm SE storm SE
pool1,2 STOR pool servers phase handover of pool1 and pool2 storage
pool3-5 STOR pool servers bring online 10,11/10
info BDII site BDII continue service
mon1 MON glite-APEL new SL5 accounting, provision before 31/12/10 DONE
mon2 MON APEL accounting decommission once mon1 running
ui UI local UI service provision after se2 decommission
squid PROXY proxy service provision after se2 decommission
argus AUTH authN server for multi-user pilot jobs no provision plan yet
vosw1-3 VOSW VO software NFS server, CVMFS test no provision plan yet

Service Deployment

slot host existing phase 1 (until 2/11) phase 2 (from 2/11)
1 mw04 CE2, SE2 (1) CE2, UI(1), PROXY(1) CE2, UI(1), PROXY(1)
2 mw05 "mw05" "mw05" failover
4 ui CE3 CE3, MON1 CE3, MON1
5 srm "srm" "srm" SE1, AUTH
6 mon empty CE1 CE1, SE2
7 - empty srm (pivot) VOSW2
8 - empty empty VOSW1, VOSW3

Virtualisation Requirements

Server estate

  • 8 hypervisors (with 1 failover) will be enough to host all middleware services required over the next six months
  • Replacement servers are provided upon host failure (with a pre-defined replacement window).

Server/Service Access

  • Administrator access and console control of each hypervisor.
  • Administrator access on all virtual instances.
  • Identify any "opt-out" areas where admin access cannot given outside of systems team.
  • Ability to power cycle hypervisors and all virtual instances outside of OS.

Hypervisor features

  • Power management of running instances.
  • Ability to create new instances.
  • Ability to snapshot existing instances.
  • Restore image from backup/snapshots.
  • Live migration (optional).
  • Live tuning of memory and CPU resources (optional).
  • Export monitoring data to nagios and ganglia.
  • Network accessible from grid-admin and eddie.
  • Has access to a snapshot/backup repository.

Instance features

  • Access to GPFS via NFS mounts (work (rw), applications (ro)).
  • All network accessible from grid-admin and eddie.
  • Network accessible from each of the virtual instances
  • Keep storage of instance lightweight, but enough space for logs.
  • Log backup/rotation to dedicated backup area (optional).
  • Instances accessible from ppe network (optional).

Backup and Restore

  • Ability to make on-demand backup for each service.
  • Ability to make scheduled backup for each service.
  • Retain backup files in central repository (up to a certain date/space limitation).
  • Restore will be run only on host failure / outage > 12 hours.
  • Rapid restore of core services on failover hypervisor
  • No issues with host/IP conflicts or firewall upon instance restore

Future tasks

  • Deploy and test ECDF VM solution
  • Dedicated NFS (or CVMFS) host/service
  • Consolidate site monitoring
  • Configuration management service
  • Change management (practice for reverse all changes)
  • Intervention logbook
  • Documentation!
  • Multi-user pilot jobs
  • ArcCE testing and production
  • ATLAS analysis readiness
  • Job efficiency studies
  • Production squid service

ECDF/GridPP research projects

  • CVMFS performance testing
  • Advertise GPU queue on the grid
  • Whole node Worker Nodes for multicore studies.
  • High memory server testing


Process Responsible
Host Certs Steve
Security updates Steve + someone per node for kernels
lcg-CA Andy/Steve/Orlando?
Availability Monitoring Wahid/Andy
Atlas / experiment monitor Wahid/Andy
Ggus ticket response Wahid / Andy

Open questions

  • Are other virt services (mwvm01, mwvm02, Galaxy) still being run on mw04?
  • Are admin03, mw01, mw02, mw03, grid2 available for gridpp use in the near future?

-- AndrewWashbrook - 17-Nov-2010

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2011-01-06 - AndrewWashbrook
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback