GLUE2 publication monitoring

This twiki tracks the effort of monitoring the quality of the GLUE 2 information published in the WLCG information system. Ongoing efforts are already looking into the information published by the information providers to detect bugs in the middleware that should be fixed by the middleware developers.

This twiki collects the monthly reports that evaluate the results of running glue-validator against WLCG sites.

Reports

Known Issues

Bug deleting GLUE 2 entries

As described in BUG:101237, the BDII fails to delete old GLUE 2 entries due to a bug in the code. This causes a pollution of GLUE 2.0 obsolete objects in the Information System. A workaround to this problem is to restart all resource BDIIs in the site, then remove the contents of /var/lib/bdii/gip/cache/gip/site-urls.conf-glue2 in the site BDII and restart the site BDII. Due to this bug, the glue-validator raises many E002 errors.

Misconfiguration of load balanced services

It has been noted that sites fail to properly declare load balanced services in the information system. All machines behind an alias, must be declared as site resources using the machine hostname. For example, using YAIM for a load balanced site BDII:

BDII_REGIONS="CE SE TOPBDII SITEBDII_1 SITEBDII_2"

...
SITEBDII_BDII_1_URL="ldap://<bdii1-hostname>:2170/mds-vo-name=resource,o=grid"
SITEBDII_BDII_2_URL="ldap://<bdii2-hostname>:2170/mds-vo-name=resource,o=grid"

SITE_BDII_HOSTNAME=<service-alias>

This means that we should see the GLUE2ServiceID and GLUE2EndpointID with the real hostnames of the machines and the GLUE2EndpointURL with the DNS alias.

Failed hostname -f

When the command hostname -f fails to be executed in a machine, GLUE2ServiceID and GLUE2EndpointID are wrongly generated. It has been observed that in most cases this is a temporary failure and then the command works again and the IDs can be properly generated. Due to BUG:101237, these wrong entries stay in the system and are never deleted. The command below shows this problem for site BDIIs:

ldapsearch -LLL -x -h lcg-bdii -p 2170 -b GLUE2GroupID=grid,o=glue '(&(objectClass=GLUE2Endpoint)(GLUE2EndpointID=_bdii-site_3536672524_bdii_site_3536672524))'  dn | grep GLUE2DomainID | cut -d"=" -f3 | cut -d"," -f1

This issue was reported in BUG:101562 and a fix has been released in EMI 2 and EMI 3.

Huge amount of information published by GLUE2ApplicationEnvironment

MAY-22-2013:

In GLUE 2.0, we are currently publishing ~195.000 DNs:

ldapsearch -LLL -x -h lcg-bdii -p 2170 -b o=glue dn | grep dn: | wc -l
194834

In GLUE 1.3, we are currently publishing ~68.000 DNs:

ldapsearch -LLL -x -h lcg-bdii -p 2170 -b o=grid dn | grep dn: | wc -l
67760

When trying to understand this difference, we have realized that GLUE 2 is publishing ~ 135.000 GLUE2ApplicationEnvironment DNs:

ldapsearch -LLL -x -h lcg-bdii -p 2170 -b o=glue '(objectClass=GLUE2ApplicationEnvironment)' dn | grep dn: | wc -l
134987

These objects sum up ~120MB of information when the real information is actually contained in one attribute, GLUE2ApplicationEnvironmentAppName. Collecting only this information is 11MB:

ldapsearch -LLL -x -h lcg-bdii -p 2170 -b o=glue '(objectClass=GLUE2ApplicationEnvironment)' GLUE2ApplicationEnvironmentAppName | grep GLUE2ApplicationEnvironmentAppName: > appname
ldapsearch -LLL -x -h lcg-bdii -p 2170 -b o=glue '(objectClass=GLUE2ApplicationEnvironment)' > appenv

-rw-r--r--. 1 root root 120M May 22 09:54 appenv
-rw-r--r--. 1 root root  11M May 22 09:53 appname

It should also be noted that due to BUG:101237, there are ~18.000 obsolete GLUE2ApplicationEnvironment objects, out of which ~1.700 objects are not publishing at all the creation time:

ldapsearch -LLL -x -h lcg-bdii -p 2170 -b o=glue '(&(objectClass=GLUE2ApplicationEnvironment)(!(GLUE2EntityCreationTime=2013-05-22*)))' dn | grep dn: | wc -l
17778

ldapsearch -LLL -x -h lcg-bdii -p 2170 -b o=glue '(&(objectClass=GLUE2ApplicationEnvironment)(!(GLUE2EntityCreationTime=*)))' dn | grep dn: | wc -l
1665

GluePolicy/GLUE2ComputingShare attributes

The following table explains how the GluePolicy and GLUE2ComputingShare attributes are defined by the information providers. Note that there is a more detailed table for PBS available in the CREAM Sys Admin guide.

Glue 1.3 attribute GLUE 2.0 attribute PBS/Torque queue attribute LSF queue attribute SGE queue attribute
GluePolicyMaxWallClockTime GLUE2ComputingShareDefaultWallTime (*) resources_default.walltime if defined, otherwise resources_max.walltime (seconds, or [[HH:]MM:]SS) RUNLIMIT (hours:minutes or minutes) cpu
GluePolicyMaxObtainableWallClockTime GLUE2ComputingShareMaxWallTime (*) resources_max.walltime (seconds, or [[HH:]MM:]SS) RUNLIMIT (hours:minutes or minutes) h_rt
GluePolicyMaxCPUTime GLUE2ComputingShareDefaultCPUTime (*) min(resources_default.cput, resources_default.pcput) if defined, min(resources_max.cput, resources_max.pcput) otherwise (seconds, or [[HH:]MM:]SS) CPULIMIT (hours:minutes or minutes) cpu
GluePolicyMaxObtainableCPUTime GLUE2ComputingShareMaxCPUTime (*) min(resources_max.cput, resources_max.pcput) (seconds, or [[HH:]MM:]SS) CPULIMIT (hours:minutes or minutes) h_rt
GluePolicyMaxTotalJobs GLUE2ComputingShareMaxTotalJobs max_queuable Complex computation -
GluePolicyMaxRunningJobs GLUE2ComputingShareMaxRunningJobs max_running Complex computation -
GluePolicyMaxWaitingJobs GLUE2ComputingShareMaxWaitingJobs max_queueable - max_running Complex computation -
GluePolicyMaxSlotsPerJob GLUE2ComputingShareMaxSlotsPerJob resources_default.procct if defined, else resources_max.procct Complex computation -
GluePolicyAssignedJobSlots GLUE2ComputingShareAssignedJobSlots np (from pbsnodes -a -s) Complex computation -

  • For PBS/Torque, the queue configuration is retrieved using qstat -Q -f in most cases.
  • For LSF, the queue configuration is retrieved using bqueues -l in most cases but in order to calculate the values, a complex computation is needed in many cases due to the nature of LSF.
  • For SGE, the queue configuration is retrieved using qconf -sq opsgrid | egrep "(h_rt|h_cpu)" and the results are transformed in seconds by the info provider.
  • (*)Known Issue: Time is published in hours in GLUE 1.3 (should be minutes) and in minutes in GLUE 2 (should be seconds). See this BUG:101076 and the documented CREAM Known Issue for more details.

GLUE 1 and GLUE 2 mismatch

The following YAIM variables are used for both GLUE 1 and GLUE 2:

  • CE_LOGCPU: Total number of cores/hyperthreaded CPUs in the SubCluster
  • CE_PHYSCPU: Total number of real CPUs/physical chips in the SubCluster

However the definitions in GLUE 2 are:

  • GLUE2ExecutionEnvironmentLogicalCPUs: The number of logical CPUs in one Execution Environment instance, i.e. typically the number of cores per Worker Node
  • GLUE2ExecutionEnvironmentPhysicalCPUs: The number of physical CPUs in one ExecutionEnvironment instance, i.e. the number of sockets per Worker Node

This means that in GLUE 2 we are currently publishing wrong information according to the definition of these variables. This has been tracked by the CREAM developers on his list of known issues.

Field Work

The following tables track tickets opened to sites to follow up on incorrect values published in the information system. Some of these issues are related to bugs in the information providers and some of them are due to misconfigurations in the sites. The tables below try to summarise the findings related to wrong storage and computing information to find common patterns in case of misconfigurations and to make sure information providers are fixed when a bug is found.

Apart from the BDII, the following sources of information have been used:

General Storage

GGUS ticket Summary Cause Affected Service Affected GLUE Attributes Affected site
GGUS:87570 Stange space values Not known yet. No reaction from the site DPM Unknown TR-03-METU
GGUS:90219 Negative used space Not known yet. Seems to be a bug in the DPM Information Providers DPM GlueSAStateUsedSpace: -3987365412 UKI-LT2-RHUL
GGUS:90319 Strange space values Investigations are ongoing. Seems to be caused by an upgrade of the information providers (old static ldfil file) dCache GlueSAUsedOnlineSize: 0
GlueSAStateUsedSpace: 999999
GlueSAReservedOnlineSize: 0
GlueSATotalOnlineSize: 0
INFN-ROMA1-CMS
GGUS:90321 Negative free space Not known yet. No reaction from the site StoRM GlueSAFreeOnlineSize INFN-T1
GGUS:90325 Strange space values Now known yet. No reaction from the site StoRM GlueSETotalOnlineSize: 51147657
GlueSEUsedOnlineSize: 0
GlueSAStateUsedSpace: 0
GlueSATotalOnlineSize: 1400000
INFN-BARI
GGUS:90328 Strange space values Not known yet. Seems to be a bug in StoRM StoRM GlueSETotalOnlineSize: 105011
GlueSEUsedOnlineSize: 0
UKI-SOUTHGRID-BRIS-HEP

General Computing

GGUS ticket Summary Cause Affected Service Affected GLUE Attributes Affected site
GGUS:88754 Empty Value Not specified but fixed by the site CREAM PBS GlueCEPolicyMaxCPUTime: 0
GlueCEPolicyMaxWallClockTime: 0
IN2P3-CPPM
GGUS:88772 999999999 Value Due to GGUS:82902. For LSF it will be fixed in EMI 2 Update 8 scheduled for end January 2013 CREAM LSF GlueCEPolicyMaxCPUTime: 999999999 UKI-NORTHGRID-LANCS-HEP
GGUS:88773 999999999 Value Due to GGUS:82902. For LSF it will be fixed in EMI 2 Update 8 scheduled for end January 2013 CREAM LSF GlueCEPolicyMaxCPUTime: 9999999999 INFN-PISA
GGUS:88781 99999999 Value Site manually setting MaxCPUTime=MaxWallClockTime. Fixed after LHCb requested it. CREAM LSF GlueCEPolicyMaxCPUTime: 999999999 UKI-NORTHGRID-SHEF-HEP
GGUS:88822 999999999 Value Fixed manually but also suffering from GGUS:82902 CREAM SGE GlueCEPolicyMaxCPUTime: 999999999 UKI-LT2-QMUL
GGUS:89847 Unexpected value Not known yet. No reaction from the site. Unknown Unknown RO-07-NIPNE
GGUS:89857 999999999 Value Fixed manually but also suffering from GGUS:82902 CREAM SGE GlueCEPolicyMaxCPUTime: 999999999 FZK-LCG2

Site BDII not published as part of the site

Site GGUS ticket Comments
IN-DAE-VECC-02 GGUS:93809 DONE
INDIACMS-TIFR GGUS:93808 DONE site BDII now published as part of the site. Stop publishing top BDII as part of the site (in fact publishing CERN top BDII)
NO-NORGRID-T2 GGUS:93810 DONE
praguelcg2 GGUS:93197 DONE Publishing CERN top level BDII as the site top level BDII
SE-SNIC-T2 GGUS:93835 DONE
T2_Estonia GGUS:93801 DONE
UKI-LT2-IC-HEP GGUS:94096 DONE
ru-Moscow-SINP-LCG2 GGUS:99498 ALERT!

GLUE2ComputingShareMaxCPUTime

Campaign to fix LHCb sites that are publishing the default 9999999 in the LHCb queues dashboard. Note that this is a recurring problem and this is now dynamically tracked in the Monitoring Dashboard.

Also note that as requested in GGUS:97721, CREAM developers will implement a change in the info provider to distinguished unlimited from undefined values.

Site GGUS ticket Comments
BG01-IPP GGUS:94621 DONE OK after fixing configuration error in the batch system configuration
GRISU-UNINA GGUS:94718 DONE OK after fixing configuration error. Missing resources_max.cput and resources_default.cput
INFN-T1 - DONE Published value in BDII is correct: wrong value in the LHCb dashboard is now up to date
INFN-TRIESTE GGUS:94554 DONE LHCb pointing to wrong queue. The correct queue publishes non default values
RUG-CIT - DONE Published value in BDII is correct: wrong value in the LHCb dashboard is now up to date
SARA-MATRIX GGUS:94619 DONE No limits. For LHCb limit has been configured as requested: LHCb dashboard was pointing to a queue that was in fact not supported for LHCb
UA-KNU GGUS:94720 DONE OK after fixing configuration error. Missing resources_max.cput and resources_default.cput: LHCb dashboard has removed this queue since in fact it is not supported for LHCb
UKI-LT2-IC-HEP GGUS:95315 DONE LHCb pointing to the wrong queue. The correct queue publishes non default values
UKI-LT2-QMUL GGUS:94510 DONE OK after upgrading to EMI 3
UKI-NORTHGRID-SHEF-HEP GGUS:94618 DONE OK after fixing configuration error. Missing resources_max.cput and resources_default.cput
UNI-DORTMUND GGUS:94717 DONE Testing CE published as 'Production' used by LHCb. The production CE publishes correct Max CPU times
UNINA-EGEE GGUS:94719 DONE OK after fixing configuration error. Missing resources_max.cput and resources_default.cput
RU-SPbSU GGUS:94620 DONE OK after fixing configuration error in /etc/lrms/pbs.conf. Hostname was not defined

Operating System information

Campaign to get LHCb sites to publish coherent OS name, version and releases. The relevant GLUE attributes are:

GLUE 1.3 GLUE 2.0
GlueHostOperatingSystemRelease GLUE2ExecutionEnvironmentOSVersion
GlueHostOperatingSystemName GLUE2ExecutionEnvironmentOSName
GlueHostOperatingSystemVersion NA
NA GLUE2ExecutionEnvironmentOSFamily

For SL, the following version - release has to be respected:

  • SL 4 series - Beryllium
  • SL 5 series - Boron
  • SL 6 series - Carbon

The instructions on how to publish OS information has been described by EGI in the HOWTO05 manual.

Site GGUS ticket comments
BMEGrid GGUS:94840 DONE Fixed wrong OS Name + version
CY-01-KIMON GGUS:94854 DONE Fixed inconsistent OS release + Version
GRISU-UNINA GGUS:94855 DONE Fixed inconsistent OS release + Version
IFJ-PAN-BG GGUS:94856 DONE Fixed inconsistent OS release + Version
INFN-CATANIA GGUS:94857 DONE Fixed inconsistent OS release + Version
INFN-FERRARA GGUS:94841 DONE Fixed wrong OS Name + version
INFN-NAPOLI-ATLAS GGUS:94842 and GGUS:94858 DONE Wrong OS Name + version and Inconsistent OS release + Version
INSU01-PARIS GGUS:94859 ALERT! release is wrong
PSNC GGUS:94860 DONE Fixed inconsistent OS release + Version
RAL-LCG2 GGUS:94861 DONE Fixed inconsistent OS release + Version
RO-07-NIPNE GGUS:94862 DONE Inconsistent OS release + Version
RO-11-NIPNE GGUS:94864 DONE Fixed inconsistent OS release + Version
RO-15-NIPNE GGUS:94865 DONE Fixed inconsistent OS release + Version
RU-SPbSU GGUS:94844 DONE Wrong OS Name + version
Ru-Troitsk-INR-LCG2 GGUS:94866 DONE Fixed inconsistent OS release + Version
TECHNION-HEP GGUS:94867 DONE Fixed inconsistent OS release + Version
UA-KNU GGUS:94845 DONE Fixed wrong OS Name
UKI-LT2-Brunel GGUS:94868 DONE Fixed inconsistent OS release + Version
UKI-LT2-QMUL GGUS:94869 DONE Fixed inconsistent OS release + Version
UKI-NORTHGRID-LANCS-HEP GGUS:94870 DONE Fixed inconsistent OS release + Version
UKI-NORTHGRID-MAN-HEP GGUS:94871 DONE Fixed inconsistent OS release + Version
UKI-SCOTGRID-ECDF GGUS:94873 DONE Inconsistent OS release + Version
UKI-SOUTHGRID-BHAM-HEP GGUS:94879 DONE Fixed wrong OS release
UNINA-EGEE GGUS:94874 DONE Fixed inconsistent OS release + Version

Storage Service and Share Capacity

TotalSize <> ReservedSize + FreeSize + UsedSize

Some storage services are publishing storage capacity size attributes in such a way that TotalSize=ReservedSize+FreeSize+UsedSize will never match. The glue-validator has applied the following workarounds taking into account each storage service:

  • All: even if some attribute is missing, the calculation is done with the published attributes.
  • DPM: ReservedSize is always equal to TotalSize in space tokens. Therefore, ReservedSize is not used in the calculation.
  • dCache: For online capacity, the numbers always match although ReservedSize is not published. For nearline capacity, FreeSize is not published, so it is not possible to make the numbers match (although this number is not published because Free=Total-Used).
  • StoRM: For online capacity, the numbers always match. There are some issues with nearline capacity. In general, there is no need for any workaround in the StoRM case.

Site GGUS ticket comments
pic GGUS:95668 DONE dCache. pic confirms numbers > 1 million GB (*) or < 1000 GB are correct.
UKI-SCOTGRID-GLASGOW GGUS:95816 DONE DPM. OK after understanding how DPM is calculating numbers
UKI-SCOTGRID-ECDF GGUS:95817 DONE DPM. OK after understanding how DPM is calculating numbers
praguelcg2 GGUS:96326 ALERT! DPM: wrong numbers for unreserved space, which is a known DPM issue
INFN-T1 GGUS:95665 ALERT! StoRM. This is a known StoRM issue tracked in GGUS:95666
RUG-CIT GGUS:95666 ALERT! StoRM. Values > 1 million GB (*) are correct. Nearline storage seems to be a problem. See GGUS:95666
TR-03-METU GGUS:95667 DONE DPM. Wrong service capacity numbers have been fixed

(*) Note that glue-validator was checking whether the storage capacity was higher than 1 million GB instead of 1 billion GB.

Capacity > 1 billion GB

Site GGUS ticket comments
INFN-PISA GGUS:96486 DONE incorrect values in the YAIM variable
CSCS-LCG2 GGUS:96489 DONE Site thought values should be published in bytes instead of GB
UNI-FREIBURG GGUS:96490 DONE Fixed value in /etc/dcache/info-provider.xml

444444 waiting jobs

Check the following EGI manual for more details. The table below contains the list of batch system related GLUE attributes that are dynamically modified by the batch system information providers, and whether they are known to be properly calculated and actually updated by the information providers:

Info Provider GLUE 1 attribute GLUE 2 attribute Status
info dynamic scheduler GlueCEStateTotalJobs GLUE2ComputingShareTotalJobs
GlueCEStateRunningJobs GLUE2ComputingShareRunningJobs
GlueCEStateWaitingJobs GLUE2ComputingShareWaitingJobs
GlueCEStateEstimatedResponseTime GLUE2ComputingShareEstimatedAverageWaitingTime
GlueCEStateWorstResponseTime GLUE2ComputingShareEstimatedWorstWaitingTime
GlueCEStateFreeJobSlots GLUE2ComputingShareFreeSlots
NA GLUE2ComputingShareUsedSlots Not published
info dynamic LSF GlueCEPolicyMaxObtainableCPUTime GLUE2ComputingShareDefaultCPUTime
GlueCEPolicyMaxCPUTime GLUE2ComputingShareMaxCPUTime
GlueCEPolicyMaxObtainableWallClockTime GLUE2ComputingShareDefaultWallTime
GlueCEPolicyMaxWallClockTime GLUE2ComputingShareMaxWallTime
GlueCEPolicyMaxTotalJobs GLUE2ComputingShareMaxTotalJobs
GlueCEPolicyMaxRunningJobs GLUE2ComputingShareMaxRunningJobs
GlueCEPolicyMaxWaitingJobs GLUE2ComputingShareMaxWaitingJobs Calculated in v 2.3.1-3
GlueCEPolicyMaxSlotsPerJob GLUE2ComputingShareMaxSlotsPerJob Calculated in v 2.3.1-3
NA GLUE2ComputingShareMaxMainMemory Calculated in v 2.3.1-3
NA GLUE2ComputingShareMaxVirtualMemory Calculated in v 2.3.1-3
info dynamic PBS GlueCEPolicyMaxObtainableCPUTime GLUE2ComputingShareDefaultCPUTime
GlueCEPolicyMaxCPUTime GLUE2ComputingShareMaxCPUTime
GlueCEPolicyMaxObtainableWallClockTime GLUE2ComputingShareDefaultWallTime
GlueCEPolicyMaxWallClockTime GLUE2ComputingShareMaxWallTime
GlueCEPolicyMaxTotalJobs GLUE2ComputingShareMaxTotalJobs
GlueCEPolicyMaxRunningJobs GLUE2ComputingShareMaxRunningJobs
GlueCEPolicyMaxWaitingJobs GLUE2ComputingShareMaxWaitingJobs
GlueCEPolicyMaxSlotsPerJob GLUE2ComputingShareMaxSlotsPerJob Calculated in v. 2.3.2-2
NA GLUE2ComputingShareMaxMainMemory
NA GLUE2ComputingShareMaxVirtualMemory
info dynamic SGE GlueCEPolicyMaxObtainableCPUTime GLUE2ComputingShareDefaultCPUTime
GlueCEPolicyMaxCPUTime GLUE2ComputingShareMaxCPUTime
GlueCEPolicyMaxObtainableWallClockTime GLUE2ComputingShareDefaultWallTime
GlueCEPolicyMaxWallClockTime GLUE2ComputingShareMaxWallTime
GlueCEPolicyMaxTotalJobs GLUE2ComputingShareMaxTotalJobs
GlueCEPolicyMaxRunningJobs GLUE2ComputingShareMaxRunningJobs
GlueCEPolicyMaxWaitingJobs GLUE2ComputingShareMaxWaitingJobs
GlueCEPolicyMaxSlotsPerJob GLUE2ComputingShareMaxSlotsPerJob  
NA GLUE2ComputingShareMaxMainMemory
NA GLUE2ComputingShareMaxVirtualMemory

SITE GGUS ticket comments
CERN-PROD GGUS:96529 DONE Some variables were not published by the LSF info provider. This has been fixed in the code for a future release and directly in the production CEs.
UKI-SCOTGRID-DURHAM GGUS:96530 DONE This was a transient error, it's fixed now
UKI-SCOTGRID-GLASGOW GGUS:96528 DONE The site was missing the directory where ERT/WRT calculations are stored. This is normally created by YAIM, but for some reason this wasn't created. After manual creation, it published proper values.
UKI-SOUTHGRID-RALPP GGUS:96531 DONE lcg-info-dynamic-scheduler-pbs was not installed in the CLUSTER node that was running on a different host than the CREAM CE. CREAM developers will make sure this is properly documented in the CLUSTER installation notes.

Publishing Domain called 'resource'

This problem seems to be related to a bug in a quattor template.

SITE GGUS ticket comments
BEgrid-ULB-VUB GGUS:18121 DONE
BEIJING-LCG2 GGUS:98120 DONE
IN2P3-CPPM GGUS:98118 DONE
IN2P3-IPNL GGUS:98116 DONE
IN2P3-IRES GGUS:98114 DONE
IN2P3-LAPP GGUS:99359 DONE
IN2P3-LPSC GGUS:98117 DONE
M3PEC GGUS:98112 DONE
MSFG-OPEN GGUS:98113 DONE
OBSPM GGUS:98111 DONE
RWTH-Aachen GGUS:98119 DONE
UNIV-LILLE GGUS:98115 DONE

Site Monitoring

25.11.2013: Follow up of the status of sites with Errors raised by glue-validator validating against the GLUE2 profile.

SITE GGUS ticket comments
Australia-ATLAS GGUS:99115 DONE Upgraded - E002: +4000 obsolete entries (app env objects)
BEIJING-LCG2 GGUS:99116 DONE Upgraded - E002: 4 obsolete entries (WMS)
BEgrid-ULB-VUB GGUS:99117 ALERT! Not upgraded but obsolete entries gone - E002: 6 obsolete entries (CREAM)
Belgrid-UCL GGUS:98995
GGUS:98996
DONE Upgraded - E002: 19 obsolete entries (app env objects)
DONE Condor queue. GLUE1 OK but GLUE2 incorrect. Sys admin wrote info provider for GLUE 2 and now correct. E022, E023 and E024: Default values are published. 3 shares affected in the same CREAM CE
CA-ALBERTA-WESTGRID-T2 GGUS:98987
GGUS:98989
DONE Upgraded - E002: 901 obsolete entries (app env objects)
E022, E023 and E024: Default values are published. 4 shares affected in the same CREAM CE
CA-MCGILL-CLUMEQ-T2 GGUS:99118 DONE Reconfigured CEs - E022, E023 and E024: Default values are published. 5 shares affected in the same CREAM CE
CA-SCINET-T2 GGUS:99119
GGUS:98997
E002: +1000 obsolete entries (app env objects)
E022, E023 and E024: Default values are published. 3 shares affected in the same CREAM CE
CA-VICTORIA-WESTGRID-T2 GGUS:99121 DONE Upgraded - E002: +2000 obsolete entries (app env objects)
CSCS-LCG2 GGUS:99129 ALERT! Due to ARC validity problem - E002: 38 obsolete entries
CYFRONET-LCG2 GGUS:99135 DONE VOBOX no longer runs resource BDII. E002: 4 obsolete entries (several services)
FZK-LCG2 GGUS:99136 DONE Upgraded - E002: 4 obsolete entries
GRIF GGUS:98990 E002: 168 obsolete entries (objects from several CREAM CEs and RTEs)
IEPSAS-Kosice GGUS:99137 DONE Upgraded - E002: +1000 obsolete entries (app env objects)
IFIC-LCG2 GGUS:99138/GGUS:98998
GGUS:99139/GGUS:98999
DONE Upgraded - E002: 33 obsolete entries
DONE /var/tmp/info-dynamic-scheduler-generic deleted - E022, E023 and E024: Default values are published. 32 shares affected
IL-TAU-HEP GGUS:99140 DONE var/tmp/info-dynamic-scheduler-generic was missing. E022, E023 and E024: Default values are published. 12 shares affected
IN2P3-IRES GGUS:99141 E002: 10 obsolete entries (several services)
IN2P3-LPC GGUS:99142 DONE Upgraded - E002: +1000 obsolete entries (app env objects)
INFN-BARI GGUS:99143 DONE Upgraded and decommissioned old CEs - E002: 11 obsolete entries (app env objects)
INFN-CATANIA GGUS:99144 DONE Upgraded - E002: 6 obsolete entries (several services)
INFN-CNAF-LHCB GGUS:99145
GGUS:99146
DONE Upgraded - E002: +1000 obsolete entries (app env objects)
DONE CE upgraded - E022, E023 and E024: Default values are published. 6 shares affected
INFN-FRASCATI GGUS:99147 DONE Upgraded - E002: +2000 obsolete entries (app env objects)
INFN-MILANO-ATLASC GGUS:99149
GGUS:99150
DONE Upgraded - E002: +300 obsolete entries (app env objects)
DONE A lot of manual config for Condor. Some files needed by info providers were missing - E023 and E024: Default values are published. 10 shares affected
INFN-NAPOLI-ATLAS GGUS:99151
GGUS:99152
DONE Upgraded - E002: +1000 obsolete entries (app env objects)
DONE "ldap" user missing from maui - E022, E023 and E024: Default values are published. 12 shares affected
INFN-PISA GGUS:99153 DONE CE upgraded - E022, E023 and E024: Default values are published. 6 shares affected
KR-KISTI-GCRT-01 GGUS:99154 DONE Upgraded - E002: 3 obsolete entries (site BDII)
KR-KISTI-GSDC-01 GGUS:99155 DONE Upgraded - E002: 3 obsolete entries (site BDII)
NCP-LCG2 GGUS:99156 E002: 6 obsolete entries (several services)
NDGF-T1 GGUS:99157 ALERT! Validity is very short for many objects (60s)
NIHAM GGUS:99158 DONE Missing lrms_backend_cmd in /etc/lrms/scheduler.conf because in yaim command not all the config targets were specified at once. E023 and E024: Default values are published. 3 shares affected
NIKHEF-ELPROD GGUS:99159 DONE Upgraded - E002: +500 obsolete entries (app env objects)
PSNC GGUS:99160 DONE Upgraded - E002: 36 obsolete entries (several services)
RAL-LCG2 GGUS:99161
GGUS:99162
DONE Upgraded - E002: 15 obsolete entries (several services)
DONE Condor, ARC and CREAM, lots of manual configuration and tuning. E022, E023 and E024: Default values are published. 30-70 shares affected
RO-07-NIPNE GGUS:99163 DONE Upgraded - E002: +200 obsolete entries (app env objects)
RRC-KI GGUS:99164 DONE Upgraded - E002: +1000 obsolete entries (app env objects)
SFU-LCG2 GGUS:99165 DONE emi-torque-utils had to be reinstalled. E022, E023 and E024: Default values are published. 4 shares affected
SiGNET GGUS:99166
GGUS:99167
DONE Upgraded - E002: 5 obsolete entries (app env objects)
DONE Obsolete CE - E022, E023 and E024: Default values are published. 2 shares affected
TECHNION-HEP GGUS:99168 DONE var/tmp/info-dynamic-scheduler-generic was missing. E022, E023 and E024: Default values are published. 12 shares affected
TR-03-METU GGUS:99169 DONE Upgraded - E002: 6 obsolete entries (app env objects)
TR-10-ULAKBIM GGUS:99170 DONE Upgraded - E002: 11 obsolete entries (app env objects)
UB-LCG2 GGUS:99171 DONE Upgraded - E002: 2 obsolete entries (app env objects)
UKI-LT2-IC-HEP GGUS:99172 DONE Upgraded - E002: 3 obsolete entries (app env objects)
UKI-LT2-RHUL GGUS:99173 DONE Upgraded - E002: +2000 obsolete entries (app env objects)
UKI-LT2-UCL-HEP GGUS:99174
GGUS:99176
DONE Upgraded - E002: +900 obsolete entries (app env objects)
DONE Upgraded to SL6 and fixed. E022, E023 and E024: Default values are published. 3 shares affected
UKI-NORTHGRID-LANCS-HEP GGUS:99177 DONE Upgraded - E002: 11 obsolete entries (app env objects)
UKI-NORTHGRID-LIV-HEP GGUS:99178 DONE Upgraded - E002: +4000 obsolete entries (app env objects)
UKI-NORTHGRID-MAN-HEP GGUS:98994 DONE Upgraded - E002: obsolete entries (app env objects)
UKI-SCOTGRID-ECDF GGUS:99179
GGUS:99180
ALERT! Old service to be decommissioned E002: +2000 obsolete entries (app env objects)
E022, E023 and E024: Default values are published. 13 shares affected
UKI-SOUTHGRID-RALPP GGUS:100480 DONE Upgraded - E002: +1000 obsolete entries (app env objects)
UNIBE-LHEP GGUS:99182 DONE Upgraded - E002: +2000 obsolete entries (app env objects)
ifae GGUS:99183 DONE Fixed site configuration. E022, E023 and E024: Default values are published. 18 shares affected
pic GGUS:99184 DONE Problem when batch system is also used by non grid users that are unknown to the CE and crashes. E022, E023 and E024: Default values are published. 210 shares affected
praguelcg2 GGUS:99185 DONE Upgraded - E002: +1000 obsolete entries (app env objects)
ru-Moscow-FIAN-LCG2 GGUS:99187 DONE Upgraded - E002: 3 obsolete entries (several services)
ru-PNPI GGUS:99188 DONE Upgraded - E002: 6 obsolete entries (several services)

Storage Share IDs

The following table tracks GGUS tickets opened to LHCb Tier 1 sites who publish many Storage Shares. Sometimes it is difficult to understand what type of storage has been allocated in the share and why so many shares need to be defined.

Site GGUS tickets comments
FZK-LCG2 GGUS:99750 Storage Share names come from dCache configuration. Explanations from Paul Millar
SARA-MATRIX GGUS:99809 DONE Confirmed share names and updated them in the dashboard script
IN2P3-CC GGUS:99875 DONE Confirmed share names and updated them in the dashboard script
INFN-T1 GGUS:99888 DONE Confirmed share names and updated them in the dashboard script
RAL-LCG2 GGUS:99889 DONE Confirmed share names and updated them in the dashboard script

Missing Mapping Policy objects

The following table tracks GGUS tickets opened to sites who do not publish the mapping policy for computing shares. This doesn't allow to query computing shares allocated for a particular VO, which is something needed to monitor the Max CPU time attribute for the LHCb VO.

Site GGUS tickets comments
IN2P3-CPPM GGUS:100223 Problem with the quattor configuration
GRIF GGUS:100222 Same as above
IN2P3-LAPP GGUS:100221 Same as above
IN2P3-CC-T2    

Cleaning SW Tags

The following table tracks GGUS tickets opened to sites whose CEs are preventing VO managers to clean SW tags for their VO.

Site GGUS tickets comments
LHCb
pic GGUS:101037 DONE SW tags published in a CE not allocated for LHCb. They are deleted now
UKI-SCOTGRID-ECDF GGUS:101038  
BG03-NGCC GGUS:101039 DONE No LHCb tags defined in the CE according to the sys admin since the LHCb VO is not supported at the site
ATLAS
CA-MCGILL-CLUMEQ-T2 GGUS:101041 DONE SW tags coming from a test CE have been deleted by the site
CYFRONET-LCG2 GGUS:101042 DONE tags seem to be deleted now by the site
RO-07-NIPNE GGUS:101043
GGUS:106312
DONE atlas tags are published and CE is accessible for ATLAS. Alessandro could try again to delete the tags
DONE Not sure where tags are published as they are not published by their CE (unless it's tbit01.nipne.ro, but it's refusing Alessandro's attempts). Sys admin deleted the tags.
UKI-LT2-QMUL GGUS:101044 DONE site reconfigured the common tags area. Alessandro checked that there were no tags
INFN-ROMA2 GGUS:101049 DONE Site was in downtime for a long time. Tags could be deleted now
TUDresden-ZIH GGUS:101150 DONE SW tags could be eventually deleted
BG03-NGCC GGUS:106313 DONE Alessandro has just tried and cannot remove the tags from their CE (ce02.ngcc.acad.bg). The sys admin has deleted the tags
ITEP GGUS:106314 DONE Cannot remove the tags, get "system error in unlink" even if using uberftp to remove the tag file. Sys admin has deleted the tags
UKI-NORTHGRID-LANCS-HEP GGUS:106316 DONE Cannot remove the tags, get "system error in unlink" even if using uberftp to remove the tag file. Sys admin confirmed permissions of the tag area were 'root' instead of 'sgmatlas'. He has now deleted the SW tags
CMS
BelGrid-UCL GGUS:106813 DONE Tags deleted by the site
INDIACMS-TIFR GGUS:106814 DONE Correct role given to Christoph and tags deleted
INFN-PADOVA GGUS:106815 DONE Christoph deleted the tags after the site made sure he could do it with the correct role
Kharkov-KIPT-LCG2 GGUS:106816 DONE Tags deleted by the site
Ru-Troitsk-INR-LCG2 GGUS:106817 DONE Tags probably deleted by the site
TW-NCUHEP GGUS:106818 ALERT!
UKI-LT2-RHUL GGUS:106819 DONE Tags deleted by the site
UKI-NORTHGRID-SHEF-HEP GGUS:106820 DONE Tags deleted by the sys admin

T1 Storage Deployment

This table tracks GGUS tickets opened to T1s to be able to publish in the BDII coherent storage types and versions and supported VOs as tracked in the Dashboard.

Site GGUS tickets comments
CERN None, contact by mail ALERT! Requested to have a more compact EOS versioning syntax and to remove the "unknown" string from the Castor release
RAL GGUS:106480 DONE Requested to publish meaningful version. Fixed the versioning in the dashboard scripts that were not able to parse correctly ":"
BNL-ATLAS GGUS:106483 DONE Requested to publish storage version. BNL answered that the Classic SE should not be taken into account. Removed from table.
USCMS-FNAL-WC1 GGUS:106504 DONE Requested to publish storage version. FNAL confirms Classic SE doesn't need to be taken into account

BDII vs SRM Storage Capacity

ATLAS

This table tracks GGUS tickets opened to ATLAS sites who publish different storage capacity values in the BDII and in SRM. This is monitored in the Dashboard. The SRM values are taken from Bourricot and the BDII values are taken using the following queries:

Example for GlueSATotalOnlineSize (the same query is also used for GlueSAFreeOnlineSize, GlueSATotalNearlineSize and GlueSAFreeNearlineSize):

ldapsearch -LLL -x -h SITE-BDII:PORT -b mds-vo-name=SITE-NAME,o=grid -o nettimeout=10 
   '(&(objectClass=GlueSA)(GlueChunkKey=GlueSEUniqueID=SE)
    (|(GlueSALocalID=SPACETOKEN)(GlueSALocalID=SPACETOKEN:*)(GlueSALocalID=atlas:SPACETOKEN)))' 
    GlueSATotalOnlineSize | grep GlueSATotalOnlineSize:

The SPACETOKEN names are taken from AGIS looking at each DDM Endpoint.

The following known issues have been identified:

  • For dCache sites there are no TAPE space tokens so the comparison script should use BDII online attributes instead of BDII nearline attributes (See FZK-LCG2 ticket)
  • SRM XML files older than 2 days are not taken into account. This explains the differences detected at many sites. Many tickets have been closed due to this reason since the comparison doesn't make sense if SRM values are obsolete.
  • Comparison script is run every day at 23:30 little after the SRM XML files are re-generated. Like this the comparison is done as soon as possible.

Site GGUS tickets comments
Australia-ATLAS GGUS:107916 DONE XML older than 2 days and SRM reporting 0. Fixed automatically for SRM reporting 0.
BEIJING-LCG2 GGUS:107917 DONE XML older than 2 days
CA-ALBERTA-WESTGRID-T2 GGUS:107918 DONE Difference of 4 and 2. Site not serving the ATLAS experiment but ACBR for ATLAS still published in the BDII
CA-SCINET-T2 GGUS:107919 DONE XML older than 2 days and difference of 2. Fixed automatically when comparison script in sync with SRM values generation. BDII publishes 0 for HOTDISK which is not used any more. Shouldn't it be removed from the BDII then?
CYFRONET-LCG2 GGUS:107920 DONE XML older than 2 days
DESY-HH GGUS:107921 DONE Fixed the comparison script since it was comparing the wrong space token
DESY-ZN GGUS:107922 DONE Difference of 3. Fixed automatically when comparison script in sync with SRM values generation.
FZK-LCG2 GGUS:107923 DONE Wrong BDII attributes were used
GoeGrid GGUS:107924 DONE XML older than 2 days
GRIF GGUS:107925 DONE XML older than 2 days
IFIC-LCG2 GGUS:107926 DONE Very different values. Fixed automatically in the next check. Probably due to mismatch in the time SRM and BDII values are taken.
IN2P3-CC GGUS:107927 DONE BDII publishes 0. See FZK-LCG2 as it is related to the same issue.
IN2P3-CPPM GGUS:107928 DONE Difference of 3. Fixed automatically when comparison script in sync with SRM values generation.
IN2P3-LPC GGUS:107929 DONE Very different values. Fixed automatically when comparison script in sync with SRM values generation.
IN2P3-LPSC GGUS:107930 DONE XML older than 2 days
INFN-ROMA1 GGUS:107931 DONE XML older than 2 days
INFN-T1 GGUS:107932 ALERT! Very different values (only in tape. It could be due to the StoRM bug)
NCG-INGRID-PT GGUS:107933 DONE XML older than 2 days
NDGF-T1 GGUS:107934 DONE BDII publishes 0. See FZK-LCG2 as it is related to the same issue.
pic - ALERT! AGIS points to srmatlas.pic.es instead of srm.pic.es where correct values seem to be published for ATLAS space tokens. To be checked with ATLAS
RAL-LCG2 GGUS:107935 ALERT! Very different values
RRC-KI GGUS:107936 DONE Difference of 2. Fixed automatically in the next check. Probably due to mismatch in the time SRM and BDII values are taken.
UKI-SCOTGRID-GLASGOW GGUS:107973 DONE Very different values. It has been solved automatically. Probably due to mismatch in the time SRM and BDII values are taken.

LHCb

Site GGUS tickets comments
CBPF GGUS:105572  
CERN - Different numbers in CASTOR because a small part of the capacity has been put in maintenance. This is somehow reflected on the bdii accounting, but not on the SRM one
Edit | Attach | Watch | Print version | History: r115 < r114 < r113 < r112 < r111 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r115 - 2014-09-04 - MariaALANDESPRADILLO
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback