TWiki
>
LCG Web
>
LCGServiceChallenges
>
ProgressLogs
>
ServiceChallengeFourProgress
>
ScFourServiceTechnicalFactors
(2007-10-08,
HarryRenshall
)
(raw view)
E
dit
A
ttach
P
DF
---+ ScFourServiceTechnicalFactors This chapter provides additional data on what components, capacities and constraints are involved in the delivery of the Service Challenge 4 services. While this is independent of the service level requirements, the technical implementation of a product may influence if or how a particular service level is achieved for that product. %TOC% ---++ SC4 High Level Architecture and Flows SC4 consists of a set of services based around 5 major groups. The architectures of <a href="https://edms.cern.ch/file/498079//LCG-mw.pdf">LCG-2</a>. and <a href="https://edms.cern.ch/file/476451/1.0/architecture.pdf">EGEE</a> can be used as references. | *Grouping* | *Products Covered* | *Description* | *Flows* | | WMS | RB<br>CE<br>GRPK | Workload Management System | WmsFlows | | DMS | SE<br>FTS<br>LFC | Data Management System | | | IS | EGEE.BDII | Information System | | | AAS | PX<br>VOMS | Authentication and Authorisation Services | | | MS | MONB<br>GRVW<br>SFT | Monitoring System | | ---++ Product Components Each product consists of a set of software implementing the function and dependent middleware. The dependent middleware must be running in order for the service to function. This table defines the components for each product as to be installed at CERN. Networking and Linux are assumed. Other sites implementations may differ depending on local skills and policies. | *Grouping* | *Product* | *Notes* | *Implementation* | *Database* | *Web Server* | *LDAP* | *!GridFTP* | | WMS | RB | RbNotes | RbWlcg | !MySQL | |GRIS:2135 | Yes:2811 | | WMS | CE | CeNotes | CeWlcg | !MySQL (empty) | |GRIS | Yes:2811 | | WMS | GRPK | | | | | | | | DMS | SE | | | | | | Yes:2811 | | DMS | FTS | FtsNotes | FtsWlcg | Oracle | Tomcat5 |GRIS:2135 | | | DMS | LFC | LfcNotes | <a href="http://agenda.cern.ch/askArchive.php?base=agenda&categ=a057817&id=a057817s1t4/moreinfo">Memorandum</a> LfcWlcg | Oracle | | | | | IS | EGEE.BDII | BdiiNotes | BdiiWlcg | | | GIIS<br>EGEE.BDII | | | AAS | PX | PxNotes | PxWlcg | | | Yes:2135 | | | AAS | VOMS | VomsNotes | VomsWlcg | Oracle | Apache<br>Tomcat | | | | MS | RGMA | RgmaNotes | | !MySQL | | | | | MS | SFT | | |!MySQL<br>Oracle | | | | | MS | GRVW | | | | Apache | | | ---++ Service Capacity Data The product capacity indicates the current requirements for each product. For the number of machines in each category, see <a href="#Detailed_Configuration">Detailed Configuration</a>. |*Service*|*Memory (GB)*|*File Storage (GB) *|*Oracle (GB) *|*Criticality*| |RBP | 2 |__9000__ | |C | |PXP | _2_ |_40_ | |C | |BDIIP | _2_ |_80_ | |C | |BDIIL | _2_ |_80_ | |H | |BDIIE | _2_ |_80_ | |C | |CEP | _2_ |_80_ | |C | |RGMAP | 4 | _160_ | |M | |MONBP | 4 | | |M | |GRVWP | | | |M | |SFTP | 4 | _160_ | |M | |GRPKP | | | |M | |VOMSP | _4_ | _80_ | |C | |LFCP-ALICE| 4 | |1000 |H | |LFCP-ATLAS| 4 | |1000 |H | |LFCP-CMS | 4 | |1000 |H | |LFCP-LHCB| 4 | |1000 |C | |FTSP | 4 | |1000 |C | |CTRGP | | | |C | |GRVWP | | | |M | where * File storage includes local files and !MySQL data (stored in local databases). %X% No test services have been discussed. These are not within the scope of SC4 but will be required for the complete production service. ---++ Product Availability Data Many grid services record state information as part of their operations. A failure of a component or a failover within a high availability configuration may lead to a loss of state data. This section covers what state data is required to be stored in shared storage in the event of a high availability configuration being selected to reach the required service levels. |*Product*|*HA Approach*|*Impact of Downtime*|*File State Data*|*Database State Data*| |RB | [[RbNotes#High_Availability][Filesystem Takeover]]| | | | |PX | [[PxNotes#High_Availability][Application Replication]] |Long running jobs cannot renew proxy and fail. Users cannot create new proxies. FTS transfer suspend| | | |EGEE.BDII | [[BdiiNotes#High_Availability][Multiple independent instances with DNS round robin]]|no automatic failover to external BDIIs if CERN site down.\ Some sites have their own BDIIs.\ State kept (4MB) in memory and on disk | | | |CE | [[CeNotes#High_Availability][Filesystem Takeover]]|New jobs cannot be submitted to run at the site. Status of jobs running at the site will not be reported.| | | |RGMA | | | | | |MONB | |Permanently lose monitoring data | | | |GRVW | | | | | |SFT | |Site status cannot be monitored. New or fixed sites cannot join. Broken sites will not be detected.| | | |GRPK | |Job output cannot be viewed by users| | | |VOMS |[[VomsWlcgHa][Master/Slave]] with IP address takeover |VOMS permissions are allocated with a lifetime of 24 hours. 90 minutes before expiration, a renew operation is tried. Therefore, after 90 minutes of downtime, 5% of jobs will fail every hour. | | | |LFC |DNS Round Robin| | | File Catalog | |FTS |DNS Round Robin|Single machine does not affect service. No file transfers initiated by site performed if entire service is down.| | | |CTRG |DNS Round Robin|Single machine does not affect service. No access to Castor if entire service is down.|None|None| where * Impact of downtime defines what is the result of the machine not being available (such as a reboot or repairable hardware problem) * Stateful defines if the server running the product requires state information which may be lost in the event of a failure of the data storage devices. The High Availability solution portfolio is described in ScFourHighAvailabilityPortfolio. ---++ Product Network Data Each product has its own network requirements regarding * Network capacity (High>100Mbit/s, Medium>10Mbit/s, Low<10MBit/s) * External network accessibility (outgoing means low ports protected by firewall, incoming low ports accessible by all services) * Aliases supported (can the service be identified using a network alias which is different from the hostname of the machine). OK means aliases are supported. LB means that aliases and load balanced (i.e. a list of machines can be given) or NO means not supported. * Ports In provide the list of ports > 1024 for which connectivity is required |*Product*|*Capacity*|*Accessibility*|*Aliases*| *Ports In* | |RB | M | Incoming | | | |PX | L | Incoming | | | |EGEE.BDII | L | Outgoing | LB | 2170 | |CE | L | Incoming | | | |RGMA | M | Incoming | | | |MONB | M | Incoming | | | |ARCH | M | Incoming | | | |GRVW | M | Incoming | | | |SFT | M | Incoming | | | |GRPK | M | Outgoing | | | |VOMS | M | Outgoing | | | |LFC | H | Incoming | | | |FTS | H | Incoming | | | |CTRG | H | Outgoing | | | ---++ Detailed Configuration #UidGid ---+++ Server process accounts registered names and uid/gid values The following server account names and uid and gid values have been reserved in the CERN central account registration data base CRA in order to prevent them being used by other users or groups. The values installed on a server are not taken from CRA but from a SINDES database managed by FIO group. The service group name is not stored in CRA and the associations between the uids and gids are not stored in CRA. |*Service account*|*uid *|*gid *|*Service group name*|*CRA group name*|*primary or secondary group*| |edguser |17680 |2747 |edguser |g01 |Primary| |edguser |17680 |2761 |infosys |g15 |Secondary| |edginfo |17695 |2748 |edginfo |g02 |Primary| |edginfo |17695 |2761 |infosys |g15 |Secondary| |rgma |17696 |2749 |rgma |g03 |Primary| |rgma |17696 |2761 |infosys |g15 |Secondary| |dpmmgr |17697 |2750 |dpmmgr |g04 |Primary| |lfcmgr |17700 |2751 |lfcmgr |g05 |Primary| |ceuser |17719 |2752 |ceuser |g06 |Primary| |condor |17728 |2753 |condor |g07 |Primary| |wmsuser |17856 |2754 |wmsgroup |g08 |Primary| |hacluser |11774 |2755 |haclient |g09 |Primary| |gridview |15257 |2756 |gridview |g10 |Primary| |glite |21086 |2757 |glite |g11 |Primary| Here are the /etc/passwd lines: <verbatim> edguser:x:17680:2747::/home/edguser:/bin/bash edginfo:x:17695:2748::/home/edginfo:/bin/bash rgma:x:17696:2749:RGMA user:/opt/edg/etc/rgma:/bin/bash dpmmgr:x:17697:2750:DPM manager:/home/dpmmgr:/bin/bash lfcmgr:x:17700:2751:LFC manager:/home/lfcmgr:/bin/bash ceuser:x:17719:2752::/home/ceuser:/bin/bash condor:x:17728:2753::/home/condor:/bin/bash wmsuser:x:17856:2754:/home/wmsuser:/bin/bash hacluser:x:11774:2755:/home/hacluser:/bin/bash gridview:x:15257:2756:/home/gridview:/bin/bash glite:x:21086:2757:/home/glite:/bin/bash </verbatim> And the lines in /etc/group: <verbatim> edguser:x:2747: edginfo:x:2748: rgma:x:2749: dpmmgr:x:2750: lfcmgr:x:2751: ceuser:x:2752: condor:x:2753: wmsgroup:x:2754: haclient:x:2755: gridview:x:2756: glite:x:2757: infosys:x:2761:rgma,edginfo,edguser </verbatim> #SamAccounts ---+++ Reserved usernames for specific (local) services, as reserved in CRA: |*Service account*|*uid *|*gid *|*Owner*| |samops |23550 | 1028 | Judit Novak | |samdteam |23551 | 1028 | Judit Novak | |samatlas |23552 | 1028 | Piotr Nyczyk | |samcms |23554 | 1028 | Andrea Sciaba | |samalice |23763 | 1028 | Patricia Mendez | |dirac |25133 | 1470 | Joel Closier | |jabber |25134 | 1470 | Joel Closier | |tomcat | none | 1028 | Production Grid-Service| |mysql | none | 1028 | Production Grid-Service| |atlsrv |28475| 1028 (local 1307) | Production Grid-Service| ---+++ Critical and High Services |*Service*|*Masters*|*Passive*|*Clones*|*Spares*|*FCports*|*Comment*| |RB-ALICE |1|0|0|0|1|Spare shared with RBP-PROD| |RB-ATLAS |1|0|0|0|1|Spare shared with RBP-PROD| |RB-CMS |1|0|0|0|1|Spare shared with RBP-PROD| |RB-LHCB |1|0|0|0|1|Spare shared with RBP-PROD| |RB-PROD |1|0|0|1|2| | |PX |2|0|2| |2|Replicated| |BDIIL |1|1|0|0| |LCG EGEE.BDII| |BDIIP |1|1|0|0| |PROD EGEE.BDII (CERN Site)| |BDIIE |1|1|0|0| |Experiment EGEE.BDII| |CE |1|1|0|0|2| | |VOMS |2|1|0|0|0| | |FTS |7|0|0|2| |Spare shared between VOs| |LFC-LHCB |2|0|0|0| |Spare shared between VOs| |LFC-ALICE|1|0|0|0| |Spare shared between VOs| |LFC-ATLAS|1|0|0|0| |Spare shared between VOs| |LFC-CMS |1|0|0|0| |Spare shared between VOs| |LFC-SHARED |1|0|0|0| |Shared server for other VOs| |LFC-PROD |1|0|0|0| |Backup lfc server for all| |GRVW |1|1|0|0| |Grid View| Thus * %CALC{"$SUM(R2:C2..R$ROW(0):C2)"}% spaces for master mid range servers in the LCG area * %CALC{"$SUM(R2:C3..R$ROW(0):C5)"}% spaces for backup/slaves in the LCG area * %CALC{"$SUM(R2:C6..R$ROW(0):C6)"}% fibre channel ports required The full list of machines is therefore |*Machine*|*Service*|*CDB Cluster*|*Purpose*|*Area*|*Config*|*Comment*| |bdii001|BDIIL|gridbdii|LCG EGEE.BDII Master|UPS|Basic Midrange Server|In prod.To be logically moved to LCG| |bdii002|BDIIL|gridbdii|LCG EGEE.BDII Backup|UPS|Basic Midrange Server|In prod.To be logically moved to LCG| |bdii101|BDIIL|gridbdii|LCG EGEE.BDII Master| |Basic Midrange Server|Switch1.Add to load balancing then stop bdii001. Priority 1| |bdii102|BDIIL|gridbdii|LCG EGEE.BDII Backup| |Basic Midrange Server|Switch2. Add to load balancing then stop bdii002. Priority 1| |bdii103|BDIIP|gridbdii|Site EGEE.BDII Master| |Basic Midrange Server|Switch2. Priority 1| |bdii104|BDIIP|gridbdii|Site EGEE.BDII Backup| |Basic Midrange Server|Switch1. Priority 1| |bdii105|BDIIE|gridbdii|Experiment EGEE.BDII Master| |Basic Midrange Server|Switch1. Priority 1| |bdii106|BDIIE|gridbdii|Experiment EGEE.BDII Master| |Basic Midrange Server|Switch2. Priority 1| |ce101|CEP|gridce|Production CE Master|NFC|Basic Midrange Server|Switch2.Leave unused ce001 in UPS area for now. Priority 2| |ce102|CEP|gridce|Production CE Backup|NFC|Basic Midrange Server|Switch1. Priority 2| |fts101|FTSP|gridfts|production FTS Transfer Agent Master| |Large Memory Midrange Server|Switch1. Priority 4| |fts102|FTSP|gridfts|production FTS Transfer Agent Hot Spare| |Large Memory Midrange Server|Switch2. Priority 4| |fts103|FTSP|gridfts|production FTS Web Server Master| |Large memory Midrange Server|Switch1. lb name prod-ftsws. Priority 4| |fts104|FTSP|gridfts|production FTS Web Server Master| |Large Memory Midrange Server|Switch2. lb name prod-ftsws. Priority 4| |fts105|FTSP|gridfts|production FTS Alice agent| |Basic Midrange Server|Switch1. alias prod-ftsvo-alice. Priority 4| |fts106|FTSP|gridfts|production FTS Atlas agent| |Basic Midrange Server|Switch2. alias prod-ftsvo-atlas. Priority 4| |fts107|FTSP|gridfts|production FTS CMS agent| |Basic Midrange Server|Switch1. alias prod-ftsvo-cms. Priority 4| |fts108|FTSP|gridfts|production FTS LHCB agent| |Basic Midrange Server|Switch2. alias prod-ftsvo-lhcb. Priority 4| |fts109|FTSP|gridfts|production experiment agent Hot Spare| |Basic Midrange Server|Switch1. Priority 4| |grvw001|GRVWP|gridgrvw|production GRIDVIEW Web server| |Basic Midrange Server|Switch1. Priority 5| |grvw002|GRVWP|gridgrvw|production GRIDVIEW data mining server| |Basic Midrange Server|Switch2. Priority 5| |lfc101|LFC-LHCB|gridlfc|production LHCb LFC| |Basic Midrange Server|Switch1. alias prod-lfc-lhcb. Priority 7| |lfc102|LFC-LHCB|gridlfc|production LHCb LFC Backup| |Basic Midrange Server|Switch2. Priority 7| |lfc103|LFC-ALICE|gridlfc|production Alice LFC| |Basic Midrange Server|Switch2. alias prod-lfc-alice. Priority 7| |lfc104|LFC-ATLAS|gridlfc|production Atlas LFC| |Basic Midrange Server|Switch2. alias prod-lfc-atlas. Priority 7| |lfc105|LFC-CMS|gridlfc|production CMS LFS| |Basic Midrange Server|Switch2. alias prod-lfc-cms. Priority 7| |lfc106|LFCP|gridlfc|production shared LFC| |Basic Midrange Server|Switch1. alias prod-lfc-shared. Priority 7| |lfc107|LFCP|gridlfc|production LFC backup| |Basic Midrange Server|Switch1. Priority 7| |rb101|RB-ALICE|gridrb|RB for Alice|NFC|Extra disk Midrange Server|Switch1. Priority 8| |rb102|RB-ATLAS|gridrb|RB for Atlas|NFC|Extra disk Midrange Server|Switch1. Priority 8| |rb103|RB-CMS|gridrb|RB for CMS|NFC|Extra disk Midrange Server|Switch1. Priority 8| |rb104|RB-LHCB|gridrb|RB for LHCB|NFC|Extra disk Midrange Server|Switch1. Priority 8| |rb105|RB-PROD|gridrb|RB for other VOs|NFC|Extra disk Midrange Server|Switch1. Priority 8| |rb106|RB-PROD|gridrb|RB spare|NFC|Extra disk Midrange Server|Switch2. Priority 8| |px101|PXP|gridpx|Production !MyProxy Master|NFC|Basic Midrange Server|Switch2. Priority 3| |px102|PXP|gridpx|Production !MyProxy Slave|NFC|Basic Midrange Server|Switch1. Priority 3| |px103|PXP|gridpx|Production !MyProxy Master for FTS| |Basic Midrange Server|Switch2. Priority 3| |px104|PXP|gridpx|Production !MyProxy Slave for FTS| |Basic Midrange Server|Switch1. Priority 3| |voms101|VOMSP|gridvoms|Production VOMS Master|NFC|Large Memory Midrange Server|Switch1. Priority 6| |voms102|VOMSP|gridvoms|Production VOMS Slave|NFC|Large Memory Midrange Server|Switch2. Priority 6| |voms103|VOMSP|gridvoms|Production VOMS ldap publisher|NFC|Basic Midrange Server|Switch2. Priority 6| * A Basic Midrange Server has 2GB memory and 160GB internal mirrored disk. A configuration larger than this would also be ok. * A large memory midrange server has the same configuration as a basic midrange server but with 4GB memory. * An extra disk midrange server has the same configuration as a basic midrange server but with two extra 250GB disks run mirrored. * The Resource Brokers should have extra disks. The plan is to replace the first servers with recuperated tape servers, which have extra memory and an HBA in and will not need the extra disks, when the SAN infrastructure is in place. * UPS means in the diesel backed critical area * NFC means that the machine needs to be near a fibre channel switch in the LCG network area * Priority 1 is highest. Items to priority 5 should be completed in 2005. ---++ Service Class Criteria |*Attribute*|*Class U*|*Class L*|*Class M*|*Class H*|*Class C*| | *Facilities* |||||| |Controlled physical access| | |Badge|Badge|Badge| |Power into Data Centre| | | |Redundant|Redundant| | *Physical* |||||| |Power connection on UPS<br>If HA, only 1 machine required on UPS| | | |Yes|Yes| |Machine in rack | | |Yes|Yes|Yes| | *Hardware* |||||| |Redundant power supply in PC | | | |Yes|Yes| |Internal system disks mirrored| | |Yes|Yes|Yes| |Console remotely accessible | |Yes|Yes|Yes|Yes| | *Storage* |||||| |Minimum RAID Levels for data | |5|5|5 |5 | |Redundant Controllers / Paths | | | |Yes|Yes| | *Backup* |||||| |Off-site copies of backup data| | | | | | |Yearly backup/restore test | | | | | | | *Networking* |||||| |Redundant network cards| | | | | | | *Monitoring* |||||| | Status command for each component | | |Yes|Yes|Yes| | Automatic Event reported to console if component down | | |Yes|Yes|Yes| | *Configuration* |||||| | Automatic configuration from database/xml | | | |Yes|Yes| | *High Availability* |||||| | Standby Levels | | |Cold|Warm|Hot| | Procedures for failover | | |Administrator|Operator|Automatic| ---++ Product Evaluation In order to assess what technical factors may cause problems to deliver the quality of service requested, the ScFourTechnicalQuestionnaire has been written. With these questions, an assessment of the readiness of the application and infrastructure to provide the requested service level can be made. The current servers involved in delivering the service are defined at PreSc4ServersInfo. ---++ Issues The following items have been raised as part of the evaluation of the technical solutions. %EDITTABLE{ header="|*Nr*|Description|*Status*|*Open Date*|*Who*|*Log*|" format="| row, -1 | text, 35, init | select, 1, open, inprogress, closed | date | text, 12 | text, 20 |"}% |*Nr*|*Description*|*Status*|*Open Date*|*Who*|*Log*| | 1 | Service definition for !MySQL | inprogress | 2005/09/15 | Bernd | IssueMySQLService | | 2 | RB disk space estimates are very large | inprogress | 2005/09/15 | Maarten | IssueRbDiskSpace | ---++ Assumptions In order to accelerate the definition of the services, some assumptions have been made by the fabric team. This section documents these. %EDITTABLE{ header="|*Nr*|Description|*Status*|*Open Date*|*Who*|*Log*|" format="| row, -1 | text, 35, init | select, 1, open, inprogress, closed | date | text, 12 | text, 20 |"}% |*Nr*|*Description*|*Status*|*Open Date*|*Who*|*Log*| | 1 | EGEE.BDII is outgoing connectivity only | closed | 2005/09/22 | Tim | Port 2170 required which is covered by outgoing connectivity | | 2 | CE has !MySQL installed and an empty database. | closed | 2005/09/22 | Tim | Database created by install | | 3 | CEmon is not included in SC4 | closed | 2005/09/26 | Maarten | To be reviewed in Dec 2005 | | 4 | myproxy does not require external connectivity for low ports | open | 2005/09/26 | Tim | Need to identify contact | | 5 | myproxy data is all stored in /var/myproxy | closed | 2005/09/26 | Tim | Review replication procedure with NCSA developers. | -- Main.TimBell - 05 Sep 2005 s
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r75
<
r74
<
r73
<
r72
<
r71
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r75 - 2007-10-08
-
HarryRenshall
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback