WLCG Operations effort review and study

This twiki gathers some preliminary material to be used as input for discussion on September 2014 WLCG MB. The twiki analyses what the main operation costs in terms of effort are for sites and experiments. It tries to understand where effort could be potentially reduced and how to do it.

General Feedback

ATLAS

  • More common projects shared between experiments would reduce the amount of effort put by each experiment to deal with the same type of problems. This would also simplify the life of sites who may end up with four different requests to address the same issue.
  • Mature sites in the same countries could somehow get grouped together so that experiments could interact with one single entity representing all of them instead of having to deal with several sites. This is probably not an option for new sites or countries that are new or have little experience with the grid, but it's something that experienced countries or federations could consider.

Computing

CE

ATLAS

  • Parameter passing in the CREAM CE and proper handling in the batch system would be very useful to avoid configuring several queues. This will require less queue configuration at the sites and then less work at the experiments to track and deal with so many queues. It would be good to understand what the status of this feature is in CREAM, but also in OSG CE, the new HTCondor CE (ARC CE supports this), and also which batch systems do not handle parameter passing properly.

CMS

  • Main effort is on queue configuration.

Sys admins

  • CREAM CEs are now stable and do not imply much work from the sys admins point of view.
  • Queue configuration for ATLAS was not easy at the beginning.

Batch

CMS

  • Existing framework interacts with the CE although local users can submit to batch system instead of CMS framework directly. This requires batch system knowlege on the CMS side to give support to users and the coordination with PES to manage CMS LSF shares.

Sys admins

  • Torque/MAUI easy for basic configuration but many configuration variables that are not well documented. Lack of support make sites move to something else.
  • LSF very stable.

Cloud

ATLAS

  • Cloud resources have a cost behind. The fair share mechanism provided by the bacth system is needed. With clouds it is the experiment who has to manage things that are normally done by the batch sys admin. ATLAS currently trying both production or analysis use cases but configuration is done manually, no dynamic allocation of resources currently possible.

CMS

  • Used for production jobs at T0. CMS framework supports submission to Openstack. Not clear how to do resource provisioning, which is what the batch system does. Maybe priorities and shares would need to be defined in the experiment framework. Right now static allocation of VMs.
  • Not foreseen for analysis use case for the time being.

Sys admins

  • Cloud installations are common in many sites, but are used as test installations, not for production activities
    • Lack of fair share mechanisms
    • Not all VOs ready to use cloud resources
  • No common recipe makes cloud configuration difficult to sites since it requires a lot of man power

Testing: Hammercloud

Sys admins

  • Very useful to sites, especially with emails sent to sys-admins

Data Management

Storage Systems (dCache, CASTOR/EOS, StoRM, DPM)

ATLAS

  • Being able to deal with several protocols is good since each protocol is good for different use cases (SRM for tape, http for logging, etc). It would be good to consolidate a few of them that are good for the use cases needed by the experiments.

CMS

  • Tape-disk separation already less effort because tape systems only used as an archival system.
  • CMS doesn't rely on any specific product. From time to time there are issues with the individual storage solutions (problems with protocols and CMS specific libraries in the past). One single solution would make life easier to experiments but less flexible for sites. For instance, EOS makes it easier to manage user quota.
  • Good if we move towards one single protocol, like xroot, allowing both remote and local access with xroot.
  • Federations would allow opportunistic resources or T3 with no storage to read data remotely.
  • No easy way to know how much data used by users. CMS currently develops its own data monitoring solution.

Sys admins

  • dCache is stable and doesn't require major operational costs. Sometimes problems with major upgrades which require some specific tuning for the site. dCache community is very helpful in any case.
  • Native xrootd working also pretty well for ALICE. However, site has no control on the data stored and there is not good documentation to migrate data from one server to another when i.e. old HW has to be decommissioned.
  • DPM working without problems.

Data Transfers (FTS)

ATLAS

  • Good example of a service that has been simplified: less service instances, migration from Oracle to MySQL.

Installing Experiment SW

CVMFS

CMS

  • Good example of a service that has simplified SW installation for the experiments.

Sys admins

  • Sys admins also happy with this system.

Security

ARGUS

CMS

  • Quite transparent for the experiment. Central banning service not used by CMS. In OSG, GUMS is used. Mapping could still be done using static gridmap files. Some CMS jobs have failed recently due to ARGUS.

Sys admins

  • Service quite stable now.

VOMS

MyProxy

Information Systems and accounting

BDII

APEL

Frontier

CMS

  • Fine as long as Conditions DB is in Oracle. It uses standard technology, but currently old version of Squid that needs to be updated.

Networking

ATLAS

  • Difficult to diagnose networking problems. A procedure to understand what steps need to be followed to diagnose a network problem would be very useful. This has already been requested to Operations in the past. Availability of network monitoring data would be also very useful.
  • PerfSONAR is far from being a reliable tool to understand the status of the network at a site. It would be good to get this right and finish the work started by the perfSONAR TF.

Sys admins

  • Not easy to use for network monitoring.

Site Survey questions

Question Answers Notes
Site organisation
What is the name of your site (it will remain confidential)? site name  
What type of tier is your site? 0, 1, 2  
How many LHC VOs does your site support? n  
How many non-LHC VOs does your site support? n  
How much effort is spent in service operations and other activities? FTE  
  Batch system    
Worker nodes    
Storage system    
Networking    
Computing Elements    
perfSONAR    
Local monitoring    
squid servers    
Argus    
Information system    
VO boxes    
Other Grid services (please specify)    
Providing support via tickets    
Experiment contacts    
WLCG meetings    
Active participation to WLCG task forces, working groups, etc.    
Testing new technologies    
Other WLCG-related tasks (please specify)    
Service upgrades and changes
Do you think that the frequency of middleware releases is manageable for your site? Not at all / Barely / Usually / Quite / Perfectly  
Are you satisfied with the support (including documentation, step-by-step instructions, etc.) you get from WLCG during service upgrades/changes? Not at all / Slightly / Moderately / Quite / Extremely  
Is it easy to find the right documentation and repositories, when you search for it? Not at all / Slightly / Moderately / Quite / Extremely  
In which repositories is most important to find the RPMs to install or upgrade a service (select at most two)? EPEL / EMI / UMD / WLCG  
How difficult is to perform standard upgrades from standard repositories? Not at all / Slightly / Moderately / Quite / Extremely  
Do you have any comments or suggestions on how to improve service upgrade operations? free text  
Communication
How important is that requests originated by experiments are communicated via WLCG operations rather than by the experiments themselves? Not at all / Slightly / Moderately / Quite / Extremely  
Do you think that communication between the site and WLCG operations is effective? Not at all / Slightly / Moderately / Quite / Extremely  
What could be done to improve the communication between the site and WLCG operations? free text  
Do you think that sharing of information across WLCG sites is effective? Not at all / Slightly / Moderately / Quite / Extremely  
How would you improve the sharing of information across WLCG sites? free text  
What are, or would be, your preferred channels to communicate with other sites (at most three choices)?    
  Meetings    
Mailing lists    
Wiki or other web pages    
Web forums    
Other    
If possible, provide examples for the selected answers free text  
Does your site regularly follow the fortnightly WLCG operations coordination meeting? Never / Rarely / Usually / Often / Always  
Does your site regularly read the minutes of the WLCG operations coordination meeting? Never / Rarely / Usually / Often / Always  
What changes do you think would make the meeting more effective and interesting for you as a site? free text  
Do you think that, overall, WLCG operations Task Forces and Working Groups are useful for your site? Not at all / Slightly / Moderately / Quite / Extremely  
If your site is not involved in a TF or WG, please indicate the main reason(s) free text  
Are you satisfied with GGUS as the official user support tool? Not at all / Slightly / Moderately / Quite / Extremely  
What improvements would you like to see in GGUS? free text  
When WLCG expects a certain action from a site (service upgrades and reconfiguration, etc ), what channels do you want to be used, in order of importance?    
  WLCG broadcasts [1, 2, 3]  
GGUS tickets [1, 2, 3]  
Operations meetings [1, 2, 3]  
Monitoring
Do you think that the results of the SAM tests are complete enough to assess the level of functionality of your site? Not at all / Slightly / Moderately / Quite / Extremely  
Do you think as a site that the SAM tests are reliable in telling if something is working properly (e.g. negligible fractions of false positives or negatives)? Not at all / Slightly / Moderately / Quite / Extremely  
Do you find SAM tests easy to understand and well documented? Not at all / Slightly / Moderately / Quite / Extremely  
Is the output of a failed SAM test complete enough to understand the cause of a site problem? Never / Rarely / Usually / Often / Always  
How do you usually find out that a SAM test is failing at your site?    
  By receiving a ticket    
By periodically checking the SAM web page    
From an alarm from your local monitoring system, interfaced to SAM    
From WLCG availability/reliability reports    
Other (please specify)    
What improvements to the SAM monitoring would your site like to be implemented? free text  
Overall, how useful do you consider these types of site monitoring )? Not at all / Slightly / Moderately / Quite / Extremely  
  SAM    
Hammercloud    
Real production and analysis jobs    
Data transfer metrics    
Network monitoring    
Other (please specify)    
Please, describe below any ideas you may have to improve the site monitoring    
Grid service administration
Please rate how easy is to perform the following operations in the administration of service X? 1 = Very hard / 2 = somewhat hard / 3 = normal / 4 = rather easy / 5 =extremely easy  
Service Accessing adequate documentation First deployment Service upgrades (including security patches) Reconfigurations Troubleshooting and fixing problems Getting support from the developers
Batch system            
Worker nodes            
Storage system            
Networking            
Computing Elements            
perfSONAR            
Local monitoring            
squid servers            
Argus            
Information system            
VO boxes            

Site Survey Results

The results are collected in this page.

-- MariaALANDESPRADILLO - 22 Jul 2014

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2015-02-25 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback