Difference: ProductionOperationsWLCGFeb11Reports (1 vs. 2)

Revision 22011-03-10 - unknown

Line: 1 to 1
Deleted:
<
<
 

February 2011 Reports

To the main
Added:
>
>

28th February 2011 (Monday)

Experiment activities:

  • MC production running smoothly.
  • Certification for the next Dirac release is ongoing

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0
Issues at the sites and services

  • T0
  • T1
    • Pilots aborted at SARA-MATRIX GGUS:67983
    • IN2P3: electrical problem on Saturday, advice was sent only to internal ml. No problem spotted on LHCb side.
  • T2 site issues: :
    • Some T2 sites are running CREAM CE 1.6.4 and LHCb jobs fail because of (BUG:78565), e.g. LCG.BHAM-HEP.uk, LCG.ITWM.de, LCG.JINR.ru, LCG.KIAE.ru, LCG.Krakow.pl. We'll submit GGUS tickets for each one of them.

25th February 2011 (Friday)

Experiment activities:

  • MC production running smoothly.
  • Certification for the next Dirac release is ongoing
  • The failing SAM jobs yesterday were due to a misconfiguration of Dirac which has been fixed now.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • Snow ticket on slow interactive usage of lxplus nodes open at INC014928.
  • T1
    • NTR
  • T2 site issues: :
    • Some T2 sites are running CREAM CE 1.6.4 and LHCb jobs fail because of (BUG:78565), e.g. LCG.BHAM-HEP.uk, LCG.ITWM.de, LCG.JINR.ru, LCG.KIAE.ru, LCG.Krakow.pl

24th February 2011 (Thursday)

Experiment activities:

  • MC production running smoothly.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • CREAM CE bug on trunctation of arguments (BUG:78565). Patch is certified and ready for staged rollout.
    • Snow ticket on afs slowness open at INC014928.
  • T1
    • NTR
  • T2 site issues: :
    • NTR

23rd February 2011 (Wednesday)

Experiment activities:

  • MC production running smoothly.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

Problem of truncated parameters when submitting to CREAM CE after upgrade to 1.6.4 (BUG:78565). A patch has been rolled quickly and installed on a couple of problematic CREAMCE at GridKA and CERN. LHCb confirm the problem has gone.

  • T0
    • Recovered normal operations after the downtime of all DIRAC services yesterday.
    • There is a ever more evident problem with AFS - in general. For all users, it is almost impossible working on lxplus irrespectively the location in the lab. Users report long times to execute basic AFS commands like ls or any "tab completion commands". Snow ticket open a couple of days ago INC014928. The issue is related to that one: http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/IncidentArchive/20110214-lxbatch.htm and we confirm that some nodes are now OK and others (ex. lxplus435) are still in a bad status
  • T1
    • NTR
  • T2 site issues: :
    • NTR

22nd February 2011 (Tuesday)

Experiment activities:

  • Problem of truncated parameters when submitting to CREAM CE after upgrade to 1.6.4 (BUG:78565)

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0
Issues at the sites and services

  • T0
    • Dirac system drained yesterday due to intervention / shutdown today of voboxes hosting Dirac service. The downtime is over and services have been restarted.

  • T1
    • NTR
  • T2 site issues: :
    • NTR

21st February 2011 (Monday)

Experiment activities:

  • MC activity further production submitted in the w/e.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0
Issues at the sites and services

  • T0
  • T1
    • GridKA: SRM endpoint not working (no matter the space token). GGUS was not working too......open GGUS afterward (GGUS:67687)
  • T2 site issues: :
    • NTR

18th February 2011 (Friday)

Experiment activities:

  • Another drop in the number of running jobs last night. Found that a feature introduced yesterday into the dirac pilot sw was crashing systematically all pilots. It has been discovered this early morning and fixed promptly. Jobs slowly coming back to sites

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

To be verified but it looks like that some version of CREAMCE introduced some problem in the execution of the pilot jobs; cream-3 @ GRIDKA and ce204-205-206 @ CERN are examples of CREAMCE with this peculiar problem. The same is not present in other CREAMCE endpoints (like the one @pic). It would be nice to have the version installed on these particular nodes.

  • T0
    • NTR
  • T1
    • NTR
  • T2 site issues: :
    • NTR

17th February 2011 (Thursday)

Experiment activities:

  • Smooth operations.
  • Focus on the certification of Dirac-v6r0.
New GGUS (or RT) tickets:
  • T0: 0
  • T1: 0
  • T2: 1
Issues at the sites and services
  • T0
    • Shared area issue: investigation on going in close touch with LHCb people (GGUS:67264).
  • T1
    • Scheduled downtime at CNAF of the Oracle DB behind the LFC-RO.
  • T2 site issues: :

16th February 2011 (Wednesday)

Experiment activities:

  • Yesterday's drop of jobs has been explained with the recent upgrade of CERN CA (to version 1.38-1). The format of the certificates directory has changed. [hash].0 files - used to include the certificates of the CAs - are now links to [CA].pem files (with support for openssl9.8 and openssl1.0 hashing, that has changed). Bundled certificates in DIRAC were not reflecting this new structure and systematically the pilot job submission was failing.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • Shared area issue: investigation on going in close touch with LHCb people (GGUS:67264).
    • Issue with the BDII reporting inconsistent information for CERN queues : bottom line?
  • T1
    • NTR
  • T2 site issues: :
    • NTR

15th February 2011 (Tuesday)

Experiment activities:

  • After a period with MC running steady at full steam (40-50K jobs/day), last night a drop in the number of jobs due to an internal DIRAC issue (incompatibility b/w pilot/central service versions after a recent patch release).
  • Defining the road map for the major DIRAC release that hopefully should happen on Tuesday when the power cut on the CC will also force a draining of the system.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
  • T1
    • GRIDKA: hundreds of jobs not longer visible in LHCb but whose process is left running as zombie process absorbing 20-30GB of virtual memory. Our contact person in GridKA is looking strace/gdb at these process and eventually will inform core software developers.
  • T2 site issues: :
    • NTR

14th February 2011 (Monday)

Experiment activities:

  • MC and user activity smooth operations.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0
Issues at the sites and services
  • T0
    • Some AFS slowness observed.Open (as requested) a GGUS ticket on Friday afternoon with all possible information about the problem provided (GGUS:67264).
    • ce114.cern.ch. Many pilots aborting with Globus error 12. This is symptomatic of some misconfiguration of the Gatekeeper (GGUS:67253). No news since Friday.
    • Sysadmins approached me on Friday (after the WLCG ops) reporting about the huge amount of pending jobs in the grid queues @CERN. This is an issue similar to the problem experienced at GridKA with the BDII reporting inconsistent information for which the Rank was 0 and then erroneously attracting jobs. Ulrich put a patch in pre-production on Saturday that seems to improve the situation (but not completely).
  • T1
    • CNAF: a bunch of jobs failing at around 4pm on Sunday setting up the runtime environment. Shared area issue (GGUS:67282)
  • T2 site issues: :
    • NTR

11th February 2011 (Friday)

Experiment activities:

  • MC and user activity smooth operations.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • Some AFS slowness observed. This might explain why an increased amount of MC jobs are failing at CERN (order of 20%). Will throttle the number of MC jobs
    • ce114.cern.ch. Many pilots aborting with Globus error 12. This is symptomatic of some misconfiguration of the Gatekeeper (GGUS:67253)
  • T1
    • CNAF: a bunch of jobs failing at around 10pm yesterday setting up the runtime environment. Problems, also spotted by SAM, has gone around 11pm. Confirmed to be related with a shared area problem
    • GridKA: issue with the information system attracting jobs (GGUS:67106) has gone. Now it publishes correctly all Glue values and the rank computed on top of them is rightly unattractive. The site should not be flooded any further.
  • T2 site issues: :
    • NTR

10th February 2011 (Thursday)

Experiment activities:

  • MC and user activity smooth operations.
  • Issue in synchronizing VOMS-GGUS. Some teamers did not get propagated to GGUS.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • Some AFS slowness observed. This might explain why an increased amount of MC jobs are failing at CERN (order of 10%)
  • T1
    • NIKHEF: Discovered a problem with NFS mounted home for pool accounts in some WN. They had to restart some of them and kill jobs. (GGUS:67160)
    • IN2p3: picked up happily after the DT.
  • T2 site issues: :
    • NTR

9th February 2011 (Wednesday)

Experiment activities:

  • MC and user activity.

New GGUS (or RT) tickets:

  • T0: 2
  • T1: 1
  • T2: 0
Issues at the sites and services
  • T0
    • LFC-RO: one user was not able to connect to the service. (GGUS:67167). LFC managers report the user exposed a revoked certificate. This is weird being the certificate valid in VOMS. Asked CA support at CERN to look at this certificate revocation status (remedy:745836)

  • T1
    • NIKHEF: One user jobs systematically killed by the batch system. Open a GGUS ticket to ask more information to sys admins there (GGUS:67160)
    • NIKHEF: Since 4pm yesterday MC production jobs seem to fail due to timeout in setting up the runtime environment.
    • GridKA: Quick upgrade this morning of d-cache to fix the SRM corrupted database problem spotted the days before.
    • GridKA: misleading information advertised by the Information System attracting erroneously pilots to the already full site. (GGUS:67106). Now the situations seems to be better, anything changed GridKA side?
    • RAL: LFC/FTS/3D Oracle backend intervention. Set inactive services relying on it.
    • IN2p3: back from the DT, re-enabled the services over there.
  • T2 site issues: :
    • NTR

8th February 2011 (Tuesday)

Experiment activities:

  • Running at very high sustained rate (25K jobs in parallel) in the last days. Introduced load balancing mechanism for some internal DIRAC servers not sustaining this rate otherwise.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1
    • PIC: Observed this early morning a spike of user jobs failing with input data resolution. This is consistent with the shortage of the SRM service also reported by SAM at around 2:00 am this morning.
    • IN2p3: Major downtime today. Banned all services form the LHCb production mask.
    • GridKA: Problem with the SRM database that got screwed up (GGUS:67079). This explained the poor performances observed since Saturday.
    • GridKA: misleading information advertised by the Information System attracting erroneously pilots to the already full site. (GGUS:67106)
  • T2 site issues: :
    • NTR

7th February 2011 (Monday)

Experiment activities:

  • MC production: smooth operations during the w/e

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0
Issues at the sites and services
  • T0
    • Observed a non negligible fraction of user jobs failing at CERN. Investigating.
  • T1
    • RAL: transparent intervention on (remaining gridftp server) to use the correct checksum
    • SARA: Fraction of users failing at NIKHEF dropped after having increased the timeout in dcap connection. (GGUS:66287)
    • GridKA: almost all data management activities over there were failing (as also confirmed by SAM). Opened GGUS ticket on Saturday (GGUS:67079). Again this is an overload problem.
    • GridKA: misleading information advertised by the Information System attracting erroneously pilots to the already full site. (GGUS:67106)
  • T2 site issues: :
    • NTR

4th February 2011 (Friday)

Experiment activities:

  • MC productions: due to an internal issue in DIRAC no pilot jobs have been submitted to pick up payloads. System almost emptied.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • Sent a request to relevant CASTOR people to help our Data Manager to understand/reconstruct why some files present in high level catalogs are not available physically on the Storages (not only at CERN)
  • T1
    • RAL: after data loss, the correct re-replication is currently being checked by LHCb data manager (GGUS:66853)
    • SARA: yesterday user jobs failing.Increased the timeout in the dcap connection. (GGUS:66287)
    • GridKA: fraction user jobs failure rate a bit larger than usual. It seems a load issue.
  • T2 site issues: :
    • NTR

3rd February 2011 (Thursday)

Experiment activities:

  • MC productions running smoothly almost everywhere.
  • tcmalloc library issue: it has been discovered an incompatibility (bug) of some tcmalloc methods (getCpu()) to work with the new kernel of RHEL5.6 (the basis of SLC5.6). A patch in Gaudi has been created to use instead a minimal version of this library that does not use this method and then bypasses this incompatibility. For information the bug is tracked here: tcmalloc bug.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0
Issues at the sites and services
  • T0
    • Sent a request to relevant CASTOR people to help our Data Manager to understand/reconstruct why some files present in high level catalogs are not available physically on the Storages (not only at CERN)
  • T1
    • RAL: we lost some data on their tape. This data is available at CERN and will be replicated (GGUS:66853)
    • SARA: many user jobs failing. It seems related to a file access problem. (GGUS:66287)
    • SARA: Unscheduled restart of 4 dcache pool nodes for LHCb.
    • GridKA: also (but in less extent than SARA) user jobs failure rate a bit larger than usual. This is due to the high number of concurrent jobs running and accessing the storage creating an heavy load.
  • T2 site issues: :
    • NTR

2nd February 2011 (Wednesday)

Experiment activities:

  • MC productions running smoothly almost everywhere.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0
Issues at the sites and services
  • T0
    • Sent a request to relevant CASTOR people to help our Data Manager to understand/reconstruct why some files present in high level catalogs are not available physically on the Storages (not only at CERN)
  • T1
    • RAL: Transparent intervention for LHCb CASTOR to upgrade the gridftp server using the right checksum
    • NIKHEF: many user jobs failing. It seems again related to a shared area issue over there. (GGUS:66287)
    • GridKA: A lot of pilot jobs failing there through CREAMCE (GGUS:66899).
  • T2 site issues: :
    • NTR

1st February 2011 (Tuesday)

Experiment activities:

  • MC productions running smoothly almost everywhere. Re-stripping: tail of jobs remaining at SARA

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0
Issues at the sites and services
  • T0
    • Sent a request to relevant CASTOR people to help our Data Manager to understand/reconstruct why some files present in high level catalogs are not available physically on the Storages (not only at CERN)
  • T1
    • IN2p3 Many job failures on Saturday opening files been understood: files were garbage collected and tnot longer available on the disk cache.
    • RAL: Data integrity check preventing some stub files to be migrated to tape (spotted order of tens of such corrupted files per day). Proposed to Shaun to use the same script currently used by CASTOR at CERN plus mail notification to data manager to cross check all these files.
    • RAL: All MC jobs failing at RAL with pilot not running any longer. GGUS:66849
    • SARA: draining the remaining re-stripping jobs accumulated over there. The process goes very slowly due to the limited Fair Share and MC jobs competing with.
    • GridKA: almost all farm is available now (after the license problem in some CEs) and we observed indeed a net increase on the number of jobs running there.
  • T2 site issues: :
    • NTR
 -- RobertoSantinel - 02-Dec-2010

Revision 12010-12-02 - unknown

Line: 1 to 1
Added:
>
>

February 2011 Reports

To the main

-- RobertoSantinel - 02-Dec-2010

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback