YAIM cluster configuration: phase 2

Introduction

This wiki page describes the steps that are needed to test the new Cluster node type that contains the configuration of the GlueCluster and GlueSubcluster entities. In phase 2 the Cluster node is in a different host than the lcg CE.

Configuration changes

  • yaim lcg ce:
    • The changes in the lcg ce to allow for the Cluster node come into effect when the variable LCGCE_CLUSTER_MODE=yes is set, to select the so called "cluster mode". If the cluster mode is not selected the lcg CE configuration and operation remains the same; it is assumed that there is no Cluster node either on the CE itself nor on another machine. Conversely if setup in Cluster mode a Cluster node will need to be installed.
    • The following functions are no longer called as part of the lcg CE configuration when used in cluster mode:
      • config_gip_software_plugin
      • config_vo_tag_dir
      • config_info_service_rtepublish
    • config_gip_ce doesn't do the GlueCluster and GlueSubcluster configuration.
  • yaim cluster:
    • For software tag publication, sgm users need to be able to gridftp to the Cluster node and therefore the following functions are needed:
      • config_sysconfig_globus
      • config_sysconfig_edg
      • config_crl
      • config_host_certs
      • config_users
      • config_edgusers
      • config_mkgridmap
      • config_vomsmap
      • config_globus_gridftp
      • config_lcas_lcmaps_gt4
      • config_vomsdir
      • config_add_pool_env
    • The Cluster node type needs to be published as a service in the information system. The following functions are needed to do this:
      • config_bdii_only
      • config_gip_only
      • config_info_service_rtepublish
      • config_gip_service_release
    • The following functions to create the necessary directories to publish SW tags and publish them into the information system are now done in the Cluster node:
      • config_vo_tag_dir
      • config_gip_software_plugin
    • The following functions are now configuring the GlueCluster and GlueSubCluster
      • config_cluster
      • config_subcluster
      • config_ldif_cluster
  • yaim torque server:
    • it hasn't been changed. This implies that while lcg ce has been changed to use new variables, torque server still uses old variables, which implies that new and old variables will have to coexist at this point.
  • yaim torque utils:
    • it hasn't been changed but it's not affected by the new cluster configuration.
  • yaim clients (affecting the WN):
    • New function config_wn_info has been added

Installation changes

Created using Steve's feedback in Test Rig Log.

  • glite-CLUSTER
    • In order to allow sgm users to gridftp the Cluster host:
      • vdt_globus_data_server
      • vdt_globus_essentials
      • edg-mkgridmap
      • edg-mkgridmap-conf
      • lcg-expiregridmapdir
      • glite-initscript-globus-gridftp
      • glite-security-lcas-lcmaps-gt4-interface
      • glite-security-lcas
    • These packages are not in the original list provided by Steve but we had to add them to solve missing deps or to make the gridftp server work properly:
      • glite-security-lcmaps
      • glite-security-voms-api-c
      • glite-security-voms-api-cpp
      • glite-security-lcas-interface
      • glite-security-lcas-plugins-basic
      • glite-security-lcas-plugins-voms
      • glite-security-lcmaps-plugins-basic
      • glite-security-lcmaps-plugins-voms
      • gridsite-shared
    • In order to run a resource BDII:
      • bdii
      • glite-info-generic
      • glite-info-templates
      • glue-schema
    • In order to have the necessary information providers and plugins:
      • glite-info-provider-service
      • glite-info-provider-release
      • lcg-info-dynamic-software
      • lcg-info-provider-software
    • And finally, some packages for yaim:
      • glite-yaim-core (a special version created to fix some minor bugs detected while configuring the cluster node type).
      • glite-version
      • And also, glite-yaim-cluster, which is a new package to configure cluster.

  • lcg-CE
    • A new version of the lcg-ce yaim package has been created (v5.1.0-1 and above) which contains the configuration changes to either operate in non-cluster or cluster mode. In cluster mode the new Cluster node type is required on either the CE host or another machine.

Installation instructions

Clean installation

In order to test the new cluster configuration, you should install the following metapackages:

Host 1
--------
lcg-CE
glite-TORQUE_server
glite-TORQUE_utils

Host 2
--------
glite-CLUSTER

Host 3
-------
glite-WN
glite-TORQUE_client

In the three hosts you need to install the lcg-CA. In Host 1 and Host 2, make sure you copy the host certificates of the machine in /etc/grid-security.

For all the machines:

  • if you use a special CA for test user certificates, remember to install the CA files as well.
  • If you use a VOMS server different from the production ones, remember to install the VOMS server certificate.

In Host 1, you should install the metapackages from the Production repository as usual. You would need to either configure a site BDII yourself or get a site BDII to know about your CE and Cluster services.

Then you should upgrade the glite-yaim-lcg-ce rpm by running:

rpm -U /afs/cern.ch/project/gd/www/yaim/testing/cluster-testing/glite-yaim-lcg-ce-5.0.0-1.noarch.rpm

In Host 2, you should install the cert-glite-CLUSTER metapackage. In order to install this metapackage, use the following repo file:

[cert-glite-CLUSTER]
name=gLite 3.1 cert-glite-CLUSTER service
baseurl=http://grid-deployment.web.cern.ch/grid-deployment/yaim/testing/cluster-testing/cluster_repo
enabled=1

This repo file can be downloaded from cert-glite-CLUSTER.repo

And then run:

yum install cert-glite-CLUSTER

In Host 3, you should install a WN and Torque clients from the Production repository as usual.

Now follow the configuration instructions.

Upgrade

In Phase 2, you should only test the upgrade in the lcg CE host and in the WN host, since glite-CLUSTER is new.

Host 1
--------
lcg-CE
glite-TORQUE_server
glite-TORQUE_utils

Host 3
-------
glite-WN
glite-TORQUE_client

In Host 1, you should install the metapackages from the Production repository as usual. You can also install a glite-BDII, if you want to run a site BDII.

Run yaim to configure your services:

./yaim -c -s site-info.def -n lcg-CE (-n BDII_site) -n TORQUE_server -n TORQUE_utils

Then you should upgrade the glite-yaim-lcg-ce rpm by running:

rpm -U /afs/cern.ch/project/gd/www/yaim/testing/cluster-testing/glite-yaim-lcg-ce-5.0.0-1.noarch.rpm

In Host 3, you should install a WN and Torque clients from the Production repository as usual.

Run yaim to configure the client:

./yaim -c -s site-info.def -n WN -n TORQUE_client

Now follow the configuration instructions.

Configuration instructions

Since there's a set of new variables, you would need to change your usual site-info.def:

lcg CE variables for the cluster mode

To enable cluster node the following variable and value must be set:

LCGCE_CLUSTER_MODE=yes

Mandatory variables for the lcg CE in cluster mode: You'll find them under /opt/glite/yaim/examples/services/lcg-ce:

The new variable names follow this syntax:

  • In general, variables based on hostnames, queues or VOViews containing '.' and '_' # should be transformed into '-'
  • <host-name>: identifier that corresponds to the CE hostname in lower case. Example: ctb-generic-1.cern.ch -> ctb_generic_1_cern_ch
  • <queue-name>: identifier that corresponds to the queue in upper case. Example: dteam -> DTEAM
  • <voview-name>: identifier that corresponds to the VOView id in upper case. '/' and '=' should also be transformed into '_'. Example: /dteam/Role=admin -> DTEAM_ROLE_ADMIN

Variable Name Description Value type Version
CE_HOST_<host-name>_CLUSTER_UniqueID UniqueID of the cluster the CE belongs to string glite-yaim-lcg-ce 4.0.5-1
CE_InfoApplicationDir Prefix of the experiment software directory in a site. This variable has been renamed in the new infosys configuration. The old variable name was: VO_SW_DIR. This parameter can be defined per CE, queue, site or voview. See /opt/glite/yaim/examples/services/lcg-ce for examples. string glite-yaim-lcg-ce 4.0.5-1
CE_CAPABILITY Is a space separated list, each item will be published as a GlueCECapability attribute. It must include a CPUScalingReferenceSI00 value and may also need to include Share values. It can be defined by CE, queue or site. See /opt/glite/yaim/examples/services/lcg-ce for an example of a queue specific setting. An example site wide value is also set in site-info.def. This should be edited, or commented out and alternate value(s) set in services/lcg-ce string glite-yaim-lcg-ce-5.0.3-1

The following variables will be distributed in the future in site-info.def since they affect other yaim modules. At this moment we are in a transition phase to migrate to the new variable names.

Variable Name Description Value type Version
CE_HOST_<host-name>_CE_TYPE CE type: 'jobmanager' for lcg CE and 'cream' for cream CE string glite-yaim-lcg-ce 4.0.5-1
CE_HOST_<host-name>_QUEUES Space separated list of the queue names configured in the CE. This variable has been renamed in the new infosys configuration. The old variable name was: QUEUES string glite-yaim-lcg-ce 4.0.5-1
CE_HOST_<host-name>_QUEUE_<queue-name>_CE_AccessControlBaseRule Space separated list of FQANS and/or VO names which are allowed to access the queues configured in the CE. This variable has been renamed in the new infosys configuration. The old variable name was: _GROUP_ENABLE string glite-yaim-lcg-ce 4.0.5-1
CE_HOST_<host-name>_CE_InfoJobManager The name of the job manager used by the gatekeeper. This variable has been renamed in the new infosys configuration. The old variable name was: JOB_MANAGER. Please, define: lcgpbs, lcglfs, lcgsge or lcgcondor string glite-yaim-lcg-ce 4.0.5-1
JOB_MANAGER The old variable is still needed since config_jobmanager in yaim core hasn't been modified to use the new variable. To be done. string OLD variable

When using yaim-core >= 4.0.13 the OLD variables JOB_MANAGER, _GROUP_ENABLE and QUEUES will be set (or reset) to the values of the new replacement variables listed above. With prior versions the new and the old style need to both be set consistently.

Default variables for the lcg CE: You'll find them under:

  • /opt/glite/yaim/defaults/lcg-ce.pre:

It contains a list of CE_* variables with some default values. These are the Glue schema properties belonging to the Compuing Element and the VOView entities. By default, these variables are specified per CE, but they can also be specified per queue or per VOVIEW, depending if we want that all the VOViews of a queue share a specific value or depending if we want that a certain VOViews has a specific value. For example, if I define in site-info.def:

# In the CE vtb-generic-17.cern.ch, in the queue dteam, in the VOView dteam, 
# I want that the default value StateWaitingJobs is 666666
CE_HOST_vtb_generic_17_cern_ch_QUEUE_DTEAM_VOVIEW_DTEAM_CE_StateWaitingJobs=666666

Or I can also define:

# In the CE vtb-generic-17.cern.ch, in the queue dteam, in all the supported VOViews,
# I want that the default value StateWaitingJobs is 666666
CE_HOST_vtb_generic_17_cern_ch_QUEUE_DTEAM_CE_StateWaitingJobs=666666

If none of the above is defined, the default value for the whole CE, defined in /opt/glite/yaim/defaults/lcg-ce.pre, is taken.

The variables that can be redefined per CE-queue are:

CE_VAR="
CAPABILITY
ImplementationName
ImplementationVersion
InfoGatekeeperPort
InfoLRMSType
InfoLRMSVersion
InfoJobManager
InfoApplicationDir
InfoDataDir
InfoDefaultSE
InfoTotalCPUs
StateEstimatedResponseTime
StateRunningJobs
StateStatus
StateTotalJobs
StateWaitingJobs
StateWorstResponseTime
StateFreeJobSlots
StateFreeCPUs
PolicyMaxCPUTime
PolicyMaxObtainableCPUTime
PolicyMaxRunningJobs
PolicyMaxWaitingJobs
PolicyMaxTotalJobs
PolicyMaxWallClockTime
PolicyMaxObtainableWallClockTime
PolicyPriority
PolicyAssignedJobSlots
PolicyMaxSlotsPerJob
PolicyPreemption"

The variables that moreover can also be redefined per CE-queue-VOVIEW are:

VOVIEW_VAR="
StateRunningJobs
StateWaitingJobs
StateTotalJobs
StateFreeJobSlots
StateEstimatedResponseTime
StateWorstResponseTime
InfoDefaultSE
InfoApplicationDir
InfoDataDir
"

If the Glue schema supports other variables than the ones defined here, you can just add new ones by redefining CE_VAR and/or VOVIEW_VAR in site-info.def. It's the list of variables contained in CE_VAR and VOVIEW_VAR what YAIM uses to create the ldif file.

  • /opt/glite/yaim/defaults/lcg-ce.post:

It defines some auxiliary variables:

Variable Name Description Value type Default Value Version
CE_ImplementationVersion The version of the implementation. This should probably be in .pre instead of .post version 3.1 glite-yaim-lcg-ce 4.0.5-1
CE_InfoLRMSType Type of the underlying Resource Management System string ${CE_BATCH_SYS} glite-yaim-lcg-ce 4.0.5-1
STATIC_CREATE Path to the script that creates the ldif file path ${INSTALL_ROOT}/glite/sbin/glite-info-static-create glite-yaim-lcg-ce 4.0.5-1
TEMPLATE_DIR Path to the ldif templates directory path ${INSTALL_ROOT}/glite/etc glite-yaim-lcg-ce 4.0.5-1
CONF_DIR Path to the temporary configuration directory path ${INSTALL_ROOT}/glite/var/tmp/gip glite-yaim-lcg-ce 4.0.5-1
LDIF_DIR Path to the ldif directory path ${INSTALL_ROOT}/glite/etc/gip/ldif glite-yaim-lcg-ce 4.0.5-1
GlueCE_ldif Path to the GlueCE ldif file path ${LDIF_DIR}/static-file-CE.ldif glite-yaim-lcg-ce 4.0.5-1
GlueCESEBind_ldif Path to the GlueCESEBind ldif file path ${LDIF_DIR}/static-file-CESEBind.ldif glite-yaim-lcg-ce 4.0.5-1

Cluster variables

Mandatory variables for the cluster: You'll find them under /opt/glite/yaim/examples/services/glite-cluster:

The new variable names follow this syntax:

  • In general, variables based on hostnames, queues or VOViews containing '.' and '_' # should be transformed into '-'
  • <host-name>: identifier that corresponds to the CE hostname in lower case. Example: ctb-generic-1.cern.ch -> ctb_generic_1_cern_ch
  • <cluster-identifier>: identifier that corresponds to the cluster identifier in upper case. Example: my_cluster -> MY_CLUSTER
  • <subcluster-identifier>: identifier that corresponds to the subcluster identifier in upper case. Example: my_subcluster -> MY_SUBCLUSTER

Variable Name Description Value type Version
CLUSTER_HOST hostname where the cluster is configured hostname glite-yaim-cluster 1.0.0-2
CLUSTERS Space separated list of your cluster identifiers, Ex. ="cluster1 [cluster2 [...]]". The identifiers are only used within yaim configuration files. string list glite-yaim-cluster 1.0.0-1
CLUSTER_<cluster-identifier>_CLUSTER_UniqueID Cluster UniqueID. It may contain alphanumeric characters, dot, dash and underscore only. Upper case will be changed to lower case. string glite-yaim-cluster 1.0.0-1
CLUSTER_<cluster-identifier>_CLUSTER_Name Cluster human readable name string glite-yaim-cluster 1.0.0-1
CLUSTER_<cluster-identifier>_SITE_UniqueID Site name where the cluster belongs to. It should be consistent with your variable SITE_NAME. NOTE: This may be changed to SITE_UniqueID when the GlueSite is configured with the new infosys variables string glite-yaim-cluster 1.0.0-1
CLUSTER_<cluster-identifier>_CE_HOSTS Space separated list of CE hostnames configured in the cluster hostname list glite-yaim-cluster 1.0.0-1
CLUSTER_<cluster-identifier>_SUBCLUSTERS Space separated list of your subcluster identifiers, Ex="subcluster1 [subcluster2 [...]]". The identifiers are only used within yaim configuration files. string list glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_SUBCLUSTER_UniqueID Subcluster UniqueID. It may contain alphanumeric characters, dot, dash and underscore only. Upper case will be changed to lower case. string glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_ApplicationSoftwareRunTimeEnvironment
"sw1 [| sw2 [| ...]"
old CE_RUNTIMEENV
string list glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_ArchitectureSMPSize old CE_SMPSIZE number glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_ArchitecturePlatformType old CE_OS_ARCH string glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_BenchmarkSF00 old CE_SF00 number glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_BenchmarkSI00 old CE_SI00 number glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_MainMemoryRAMSize old CE_MINPHYSMEM number glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_MainMemoryVirtualSize old CE_MINVIRTMEM number glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_NetworkAdapterInboundIP old CE_INBOUNDIP boolean glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_NetworkAdapterOutboundIP old CE_OUTBOUNDIP boolean glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_OperatingSystemName old CE_OS OS name glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_OperatingSystemRelease old CE_OS_RELEASE OS release glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_OperatingSystemVersion old CE_OS_VERSION OS version glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_ProcessorClockSpeed old CE_CPU_SPEED number glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_ProcessorModel old CE_CPU_MODEL string glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_ProcessorOtherDescription old CE_OTHERDESCR string glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_HOST_ProcessorVendor old CE_CPU_VENDOR string glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_SUBCLUSTER_Name subcluster human readable name string glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_SUBCLUSTER_PhysicalCPUs old CE_PHYSCPU number glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_SUBCLUSTER_LogicalCPUs old CE_LOGCPU number glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_SUBCLUSTER_TmpDir tmp directory path glite-yaim-cluster 1.0.0-1
SUBCLUSTER_<subcluster-identifier>_SUBCLUSTER_WNTmpDir WN tmp directory path glite-yaim-cluster 1.0.0-1

Default variables for the lcg CE: You'll find them under

  • /opt/glite/yaim/defaults/glite-cluster.pre:

It contains the list of variables that can be configured per Subcluster. They belong to the Host and Subcluster entities in the Glue schema:

HOST_VAR="
ApplicationSoftwareRunTimeEnvironment
ArchitectureSMPSize
ArchitecturePlatformType
BenchmarkSF00
BenchmarkSI00
MainMemoryRAMSize
MainMemoryVirtualSize
NetworkAdapterInboundIP
NetworkAdapterOutboundIP
OperatingSystemName
OperatingSystemRelease
OperatingSystemVersion
ProcessorClockSpeed
ProcessorModel
ProcessorOtherDescription
ProcessorVendor"

SUBCLUSTER_VAR="
Name 
UniqueID 
PhysicalCPUs 
LogicalCPUs 
TmpDir 
WNTmpDir"

If the Glue schema supports other variables than the ones defined here, you can just add new ones by redefining HOST_VAR and/or SUBCLUSTER_VAR in site-info.def. It's the list of variables contained in HOST_VAR and SUBCLUSTER_VAR what YAIM uses to create the ldif file.

  • /opt/glite/yaim/defaults/glite-cluster.post:

It defines some auxiliary variables:

Variable Name Description Value type Default Value Version
STATIC_CREATE Path to the script that creates the ldif file path ${INSTALL_ROOT}/glite/sbin/glite-info-static-create glite-yaim-cluster 1.0.0-1
TEMPLATE_DIR Path to the ldif templates directory path ${INSTALL_ROOT}/glite/etc glite-yaim-cluster 1.0.0-1
CONF_DIR Path to the temporary configuration directory path ${INSTALL_ROOT}/glite/var/tmp/gip glite-yaim-cluster 1.0.0-1
LDIF_DIR Path to the ldif directory path ${INSTALL_ROOT}/glite/etc/gip/ldif glite-yaim-cluster 1.0.0-1
GlueCluster_OUTFILE Path to the file to store the temp file that will be used to create the ldif file path ${CONF_DIR}/glite-info-static-cluster.conf glite-yaim-cluster 1.0.0-1
GlueCluster_ldif Path to the Glue Cluster ldif file path ${LDIF_DIR}/static-file-Cluster.ldif glite-yaim-cluster 1.0.0-1

Torque server variables

Since I haven't modified the code of this yaim module, the following variables are still needed, even if there are new variable replacing them:

  • QUEUES
  • <queue-name>_GROUP_ENABLE
  • CE_SMPSIZE

WN variables

The WN configuration remains the same. However, check the new WN_LIST syntax to be able to define the Subcluster the WN belongs to. Use only the identifiers you have defined in CLUSTER_<cluster-identifier>_SUBCLUSTERS.

YAIM command

Once you have defined all the needed variables, configure the different hosts by running:

Host 1
-------
./yaim -c -s site-info.def -n lcg-CE (-n BDII_site) -n TORQUE_server -n TORQUE_utils

Host 2
-------
./yaim -c -s site-info.def -n glite-CLUSTER 

Host 3
--------
./yaim -c -s site-info.def -n WN -n TORQUE_client

What to test

Deployment tests

  • Installation tests: clean installation. (only needed for cert-glite-CLUSTER) Done by Tomasz
  • Installation tests: upgrade (only needed for lcg-CE)
  • Configuration tests: check YAIM configures all the node types without any problems. In the case of the cluster, start by defining only one cluster and one subcluster. Done by Tomasz

Basic tests

  • Basic job submission and lcg-CE testsuite (contact Gianni for this). Done by Tomasz
  • Check the Cluster and the lcg-CE publish correctly in the information system. Done by Tomasz
  • Define one subcluster and one cluster, publish a tag in the subcluster and define a job to match this tag. Make sure the job is succesfully executed. Done by Tomasz

Advanced tests

  • Add two subclusters. Publish tags to each of them with lcg_tags. Submit jobs matching the different tags and make sure they execute in the correct WN.

Feedback

  • Is it easy to use the new variables?
  • comments on the complexity of the new way to configure the information system.
  • report on bugs and other issues.

-- DavidSmith - 13-Jan-2011

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2011-01-13 - DavidSmith
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback