TWiki
>
LCG Web
>
LCGGridDeployment
>
LightweightSites2016
(2018-05-09,
AndrewMcNab
)
(raw view)
E
dit
A
ttach
P
DF
---+!! Lightweight sites to end of 2016 This is an archived page, reflecting the status at the end of 2016. %TOC% ---++ Introduction One of the goals of WLCG Operations Coordination activities is to help simplify what the majority of the sites, i.e. the smaller ones, need to do to be able to contribute resources in a useful manner, i.e. with %GREEN% _large benefits compared to efforts invested_. %BLACK% Classic grid sites may profit from %GREEN% _simpler mechanisms_ %BLACK% to deploy and manage services. Moreover, we may be able to get rid of some service types in the end. New sites may rather want to go into one of the %GREEN% _cloud directions_ %BLACK% that we will collect and document. There may be different options also %GREEN% _depending on the experiment(s)_ %BLACK% that the site supports. There is no one-size-fits-all solution. We will rather have a %GREEN% _matrix of possible approaches_, %BLACK% allowing any site to check which ones could work in its situation, and then pick the best. ---++ Lightweight Sites Profile The [[LightweightSitesProfile][Lightweight Sites Profile]] collects interfaces we need to retain or replicate. ---++ May 2016 GDB session summary [[http://indico.cern.ch/event/394782/][agenda]] ---+++ Boundaries * The session was about small sites * T0 and T1 are not directly targeted * But may profit from common simplifications * Build on earlier discussions and ideas * [[http://indico.cern.ch/event/345619/contributions/814488/][Optimisation of operational costs]], presented in Okinawa * [[http://indico.cern.ch/event/345619/contributions/814496/][Resource provisioning]], ditto * Several sessions in the [[http://indico.cern.ch/event/433164/timetable/][WLCG workshop in Lisbon]] * Storage and data access are mostly handled elsewhere * Various demonstrator projects e.g. presented in the [[http://indico.cern.ch/event/467565/][April MB]] * Federations, caches, Ceph, * Here we are more concerned with services needed to enable computing at a site * CE * Batch system * Cloud setups * !AuthZ system * Info system * Accounting * CVMFS Squid * Monitoring * Storage service deployment may also profit from generic simplifications pursued here ---+++ Storage * ATLAS have 75% of their storage provided by just 30 sites * the remaining 25% are provided by a long tail of small sites * storage is difficult to manage and requires significant support effort * small sites typically cannot afford the desired level of local support * Big vs. small sites * "Real" storage vs. caches? * Caches * Inconsistencies are "normal" and should be resolved automatically * Internal per site vs. externally visible? * !ARC CE has a cache by design * Object stores? * _Ceph_ is on the rise * More common, less grid-specific * More defensible to spend effort on ---+++ T2 vs. T3 sites * T3: a site that did not sign the WLCG !MoU * T3 sites typically are dedicated to a single experiment * Can take advantage of shortcuts * Can be pure !AliEn / DIRAC / sites * E.g. !AliEn integration with !OpenStack (Bergen Univ. Coll.) * ... * T2 sites have rules that apply * Accounting into EGI / OSG / WLCG repository * Availability / Reliability targets * EGI: presence in the info system, at least for Ops VO * Security regulations * Mandatory OS and MW updates and upgrades * Isolation & traceability * Security tests and challenges ---+++ T2 simplifications * %GREEN% *Reduce* %BLACK% the catalog of required services, where possible * %GREEN% *Replace* %BLACK% classic, complex services with new, simpler portfolio * And more %GREEN% *common* %BLACK% technologies, less grid-specific * %GREEN% *Simplify* %BLACK% deployment, maintenance and operation of services * Some sites could offer a partial portfolio, e.g. just "worker nodes" * Need to get their contributions properly recognized ---++++ Selected items from Mikalai's talk [[http://indico.cern.ch/event/394782/contributions/2154318/attachments/1270703/1882858/WLCG_GDB_May-2016_Kutouski_joining_resources_simplification.pdf][talk]] * Deploy via easy installers * But: need to make such tools, support many options, ... * Grid services: make use of VM/CT/... images * Cloud services: e.g. use !OpenStack Fuel * Small sites may provide just "worker nodes" * Drive them directly from "WLCG" * Or integrate them first per NGI/region/... * Compatibility? * Outsource complex management to remote experts ---+++ CE and batch systems * CE flavors * !ARC * !CREAM * Said to be the most complex * Maybe needed for other VOs supported by the site * HTCondor * On the rise * Batch systems * More variety, possibly with complex configurations * May not be easy to change at a site (other customers) * The majority of sites still use PBS/Torque * Still the default in EGI * HTCondor on the rise * Default in OSG * Reductions in the phase space would help ---+++ Accounting * Can it be simplified for classic grid sites? * A site's APEL host assembles records from CE and batch system * APEL would benefit from fewer CE and batch system flavors to support * !ARC CE publishes directly into central APEL service * HTCondor CE? * Transition toward cloud systems ought to help ---+++ Cloud systems * Tap directly into local cloud deployments at sites * But !OpenStack etc. may not be so easy for them either * Paradigm shift: batch slots are replaced with VM instances * Need to ensure proper accounting * Experiments can handle this today, but: * would like to see less variety, fewer interfaces to deal with * do not want to become sysadmins of the acquired resources * Other supported VOs may be unable to use such resources * Sites may still need to run their classic setups instead * Or even in parallel ---++++ Selected items Andrew's talk [[http://indico.cern.ch/event/394782/contributions/2154317/attachments/1271020/1884096/20160510-mcnab-vac-etc.pdf][talk]] * Instantiate ready-made VMs at sites * Ready to take on work for a supported experiment * Vac: autonomous hypervisors * Easy installer: Vac-in-a-Box * Vcycle: instantiate via !OpenStack API etc. * Can be done remotely * VMs shut themselves down when there is no work for their VO * Allow fair shares * In production in UK for ATLAS, LHCb, !GridPP * CMS in progress ---+++ Authorization systems * Argus now is supported under the aegis of Indigo-DataCloud * Version 1.7 is close to official release * Important improvements * CERN has upgraded already * GUMS is a cornerstone of OSG sites since many years ---+++ Information systems * [[https://wlcg-ops.web.cern.ch/wlcg-information-system-evolution][Information System TF]] explores simplifications for WLCG * BDII services may no longer be needed for WLCG * Still required for other VOs and EGI ops monitoring * Aiming for less work to support WLCG VOs ---+++ Configuration * Slow but steady move towards Puppet * YAIM still being used for many services at many sites * Some prefer Ansible, CFEngine, Chef, Quattor, Salt, ... * But they presumably know what they are doing! * Small sites also need help there * Shared modules? * Not a lot of reuse evidence yet? * DPM comes with self-contained mini Puppet * An idea for other services? ---+++ Simplified deployment * Can we provide VM images that require little local configuration? * The easiest test case may well be the Squid for CVMFS * Complex services would benefit more * But is the idea realistic for them? * Would containers be better? * Could even work at sites without a (compatible) cloud infrastructure * Ready-made (HW +) SW solutions? * Deployed in a DMZ like perfSONAR * Remotely operated by experts ---++++ Selected items from US-ATLAS talk: Ubiquitous Edge Platform [[http://indico.cern.ch/event/394782/contributions/2154320/attachments/1271438/1884305/UbiCI_-_GDB.pdf][talk]] * Cyber-infrastructure "substrate" within trusted zones * Remotely deploy services on top * Science DMZ edge container platform HW spec * Some security and network questions not yet fully resolved ---++++ Selected items from US-CMS talk: Tier-3 in a box [[http://indico.cern.ch/event/394782/contributions/2154321/attachments/1271116/1883641/T3inaboxGDBApril10th2016.pdf][talk]] * Host to submit to local and remote resources * Plus CVMFS * Plus Xrootd cache & server * Both connected to global federation * In use at 5 Univ. of California ---+++ Better documentation * Online video tutorials? * Or rather improve the textual documentation? * Screenshots in some places might be sufficient * Cost-benefit analysis? ---+++ Monitoring * Sites ought to have local fabric monitoring * Nagios, Ganglia, * SAM test failures ought to raise local alerts * And be understandable, well-documented * Supported for Nagios and used by some sites * Experiment-specific monitoring could support alert subscriptions * RSS, Atom, e-mail * !MonALISA does that ---+++ Experiment approaches and views * Natural for dedicated sites * Try to identify / encourage commonalities? ---++++ Selected items from Alessandra's talk [[http://indico.cern.ch/event/394782/contributions/2154316/attachments/1271350/1884260/20160511_GDB_lightweight_v2.pdf][talk]] * Experiments want less variety to deal with * But still be able to run "anywhere" * Some MW choices are better fits to the needs of a particular experiment * Developments in experiments may help * Example: ATLAS Event Service allows reduced !QoS requirements on sites ---++ Summary by Aug 2016 * %GREEN% *Reduce* %BLACK% the set of services required per site * Some services might no longer be needed at all * Small sites could provide a partial portfolio * And still get proper recognition * %GREEN% *Replace* %BLACK% complex grid-specific services with simpler and/or common technologies * Local cloud setups can be used directly * Complex aspects can be handled by remote experts * Site admins may not need to acquire grid expertise * %GREEN% *Simplify* %BLACK% deployment of whatever remains * Easy installers for grid / cloud services * Science DMZ: only the HW must be managed locally * Experiment computing model %GREEN% *evolution* %BLACK% can help * Expect lower !QoS from most of the sites * We aim for a matrix of %GREEN% *recipes* %BLACK% * Some may prevail, while others fade away ---++ Followup in Ops Coordination In Oct and Nov 2016 a *questionnaire* was filled out by a representative set of EGI sites. <br /> The results were [[http://indico.cern.ch/event/540424/contributions/2194899/subcontributions/212150/attachments/1381450/2100252/LW-sites-161201-v11.pdf][presented]] in the Dec Ops Coordination [[http://indico.cern.ch/event/540424/][meeting]]. <br /> The largest benefits are expected to come from %BLUE% *shared repositories* %BLACK% for: * !OpenStack images * Docker containers * Puppet modules * One already exists: https://github.com/HEP-puppet
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r1
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r1 - 2018-05-09
-
AndrewMcNab
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback