Administration Tools Requested by SA1

This is the detailed list of administration tools that the sites would like. It is being created directly as a consequence of item 14a in the SA1 list of issues for the TCG.

If you have YAIM specific requests please use YAIMsysadmins.

Num. ADDED BY DESCRIPTION
1 UFRJ/ROC CERN For me, the main issues from the list (TCG) are points 4, 13, 14A and 14B. Concerning 14A: Software pager to receive notifications from SAM; Tool to manage SE (disk pools, etc); Tool to manage LRMS (Torque in our case), allowing insertion and removal of hosts, queue configuration, scheduling configuration; Automated service checklists for independent services, to be run from cron jobs and mail the site adm if any service is unavailable
2 TRIUMF/ROC CERN As part of our 24x7 implementation TRIUMF Tier-1 site has already various tools in place, including a software pager the uses a combined information from SAM database and various Nagios sensors to alert site admins by emails and also by dial-out/SMS. We don't think it is safe to use "automatic" and generic tools to deal with the storage system, especially within a dCache configuration for example (every site has a very specific configuration). On the other hand cleanup scritps would be welcome. One important use case is massive deletion of old datasets on all SE's along with the catalog entries. It is probably preferable that these tools are developed and implemeted by the experiments and not be too generic. We will send more specific suggestions at a later stage. Regards Reda
3 SFU/ROC CERN BDII entries management; FTS administration; simple queus administration (something easier that torque tools)
4 ROC SEE start/stop/status/restart for all MW services in /opt/edg/etc/init.d or /opt/glite/etc/init.d; start/stop/status/restart for all node types in /etc/init.d - this is the most important point, and those scripts must be working (currently e.g. WMSLB cannot be stopped successfully without manual killing of left-over processes after gLite is stopped!)
5 ROC SEE automatic firewall configuration based on node types (including collocated nodes on the same machine!) and MW services
6 ROC SEE add/remove/change/suspend/unsuspend a VO; suspend/unsuspend a user by DN (from certain VO or generally); suspend/unsuspend a CA
7 ROC SEE script that would allow fast change of core and site-wide services configured to be used on nodes (e.g. change of top-level BDII used on a certain node, or change of MON used on all nodes on a site), without the need to completely reconfigure the node
8 BEIJING-LCG2/ROC CERN In my daily work, I usually explore the web site to check the dcache_SE service. For instance http://{dcache_admin_hostname}:2288/. And use the maui tools to add/remove/close/drain a queue job.
9 ROC CERN Easy altering of GlueCEStatus and GlueSEStatus.
10 ROC IT WMS/LB
- tool to allow a simple way to open/close submissions for draining jobs, maintaining the possibility to get-status and output;
- tool to allow a simple way to add/ban a VOs/users/voms groups ;
- tool to remotely check the services status and the job load of the wms/lb in a graphical (possibly web based) way ;
- define garbage collection criteria for wms and lb ;
- tools for cleaning sandboxes, old condor files and old logs ;
- tools for cleaning the LB database ;
- monitoring the information supermarket status ;
- tools to easily extract the logging info and status of users job ;
SE/LFC
- tool to check the consistency between the entrys registered in a catalog and the files on the SEs;
- tool for bulk removal of files and entries on storage and catolog if a SE died or if there are consistency problem;
- define garbage collection criteria for files on SE not registered in catalog;
Evergreen issues
- common and standard logging system on all grid nodes/services;
- clean upgrade/downgrade of the grid release/updates on grid nodes;
We know that tools or scripts for doing at least some of the things listed above already exists, but we would like to have all of them clearly bundle in a single well maintained suite and possibly with a single access point.

-- Main.thackray - 11 Jul 2007

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2007-10-12 - AlessandraForti
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback