New WLCG IS Roadmap

Please, note that this twiki is now obsolete. Please, refer to this twiki for up to date information

Roadmap

The following roadmap will be presented and discussed at the IS TF meeting on the 8th of January.

  • Implement the new WLCG IS based on AGIS. The new IS will consume GLUE2 attributes, but only the subset needed by WLCG. This subset will be clearly documented and well defined so that all sys admins and MW developers understand what is required.
  • The WLCG IS will query several sources of information. In case the BDII is needed, it will query the resource BDII directly. In this way, WLCG will stop relying on top and site BDIIs.
  • New services do not need to deploy a BDII in order to provide resource information. They could provide the needed GLUE2 subset through other means (i.e. OSG will provide GLUE2 JSON)
  • REBUS will remain as a WLCG project office tool to collect MoU sites, their pledges and accounting. Installed capacity information will be removed from REBUS and included into the new IS.

Definitions

In order to improve the information published in the new WLCG IS, clear definitions are needed. The following definitions have been proposed by the Information System Task Force. Feedback from site admins is being collected:

Proposal sent to LCG-ROLLOUT:

  • GLUE2ExecutionEnvironmentLogicalCPUs: the number of processors in one Execution Environment instance which may be allocated to jobs. Typically the number of processors seen by the operating system on one Worker Node (that is the number of "processor :" lines in /proc/cpuinfo on Linux), but potentially set to more or less than this for performance reasons.

  • GLUE2BenchmarkValue: the average HS06 benchmark when a benchmark instance is run for each processor which may be allocated to jobs. Typically the number of processors which may be allocated corresponds to the number seen by the operating system on the worker node (that is the number of "processor :" lines in /proc/cpuinfo on Linux), but potentially set to more or less than this for performance reasons.

Another proposal by Andrew after some discussion:

  • GLUE2ExecutionEnvironmentLogicalCPUs: the number of single-process benchmark instances run when benchmarking the Execution Environment, corresponding to the number of processors which may be allocated to jobs. Typically this is the number of processors seen by the operating system on one Worker Node (that is the number of "processor :" lines in /proc/cpuinfo on Linux), but potentially set to more or less than this for performance reasons. This value corresponds to the total number of processors which may be reported to APEL by jobs running in parallel in this Execution Environment, found by adding the values of the "Processor" keys in all of their accounting records.
  • GLUE2BenchmarkValue: the average benchmark when a single-process benchmark instance is run for each processor which may be allocated to jobs. Typically the number of processors which may be allocated corresponds to the number seen by the operating system on the worker node (that is the number of "processor :" lines in /proc/cpuinfo on Linux), but potentially set to more or less than this for performance reasons. This should be equal to the benchmark ServiceLevel in the APEL accounting record of a single-processor job, where the APEL "Processors" key will have the value 1.

Another proposal by Brian:

  • GLUE2Execution environment: The hardware environment allocated by a single resource request.
  • GLUE2ExecutionEnvironmentLogicalCPUs: the number of single-process benchmark instances run when benchmarking the Execution Environment.
  • GLUE2BenchmarkValue: the average benchmark result when $(GLUE2ExecutionEnvironmentLogicalCPUs) single-threaded benchmark instances are run in the execution environment in parallel.
  • GLUE2ExecutionEnvironmentTotalInstances: The aggregate benchmarck results of the computing resource divided by $(GLUE2BenchmarkValue).

Information Providers

The information providers publish information in an automatic way. How the relevant attributes are calculated by each IP should be aligned with the proposed definitions. The table below collects some information on how services are currently publishing some of these attributes:

Service Installed Capacities
dCache It includes all pools (storage nodes) known to the system, whether or not they are enabled
DPM It includes the pools/disks configured by the site admin. If the admin disables one pool or disk, the installed capacity is updated via de information provider
StoRM It just reports available space information from the underlying filesystem, there's no concept of resources in downtime visible to StoRM.
CASTOR/EOS (CERN) it is possible to retrieve information including broken disks or not, it depends on the query performed in CASTOR/EOS
CREAM Installed capacity is set statically at configuration time in GlueSubClusterUniqueID/GlueSubClusterPhysicalCPUs, GlueHostArchitectureSMPSize and GlueSubClusterLogicalCPUs attributes, while Available capacity is set dynamically in GlueCEUniqueID/GlueCeStateFreeCPUs, GlueCeStateFreeJobSlots, GlueCeInfoTotalCPUs: e.g. Torque infoprovider executes a "pbsnodes -a" and counts all CPUs not in down,offline, unknonw status to set GlueCeInfoTotalCPUs
HTCondorCE ?

Information providers are described in detail under the IS providers twiki.

Site specific recipes

The information that is manually configured by site admins should also be aligned with the proposed definitions. The table below collects some information on how sys admins are currently configuring their sites:

Sites Service Scripts Notes
CERN-PROD HTCondorCE htcondorce-cern There are only 4 values that aren't generated dynamically by calling out to the HTCondor Pool Collector and the Compute Element Schedduler. These are HTCONDORCE_VONames = atlas, cms, lhcb, dteam, alice, ilc (shortend for brevity), HTCONDORCE_SiteName = CERN-PROD, HTCONDORCE_HEPSPEC_INFO = 8.97-HEP-SPEC06, HTCONDORCE_CORES = 16. All our htcondor worker nodes expose a hepspec fact. The averaged hepspec value on the CEs above is taken by a query of all the facts and then averaged.
CREAM CE [[][UpdateStaticInfo]] It parses the LSF configuration file to extract capacities

GLUE 1 - GLUE 2 translator

In order to start consuming GLUE 2 information, a translator from GLUE 1 to GLUE 2 for the attributes published in the REBUS installed capacities is available in the table below:

GLUE 1 Attribute Definition GLUE 2 Attribute Definition
GlueHostProcessorOtherDescription: Benchmark GLUE2BenchmarkValue Benchmark Value
GlueSubClusterLogicalCPUs The effective number of CPUs in the subcluster, including the effect of hyperthreading and the effects of vistualisation due to the queueing system GLUE2ExecutionEnvironmentLogicalCPUs The number of logical CPUs in one Execution Environment instance, i.e. typically the number of cores per Worker Node.
GLUE2ExecutionEnvironmentTotalInstances The total number of execution environment instances. This Should reflect the total installed capacity. i.e. including resources which are temporarily unavailable
GlueSESizeTotal GLUE2StorageServiceCapacityTotalSize The total amount of storage of the defined type. It is the sum of free, used and reserved)

-- MariaALANDESPRADILLO - 2015-11-24

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2016-09-09 - MariaALANDESPRADILLO
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback