Difference: ProductionOperationsWLCGNov10Reports (1 vs. 2)

Revision 22010-12-02 - JoelClosier

Line: 1 to 1
 

November 2010 Reports

To the main
Added:
>
>


30th November 2010 (Tuesday)

Experiment activities: Reprocessing going at full steam. Problem with the Merging during the weekend (LHCb application problem)

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • Requested CASTOR to change the DN of the LHCb datamanager to be : " /C=BR/O=ICPEDU/O=UFF BrGrid CA/O=CBPF/OU=CAT/CN=Renato Santana "
    • The issue with CASTOR LHCBDST service class (GGUS:64693) has not been fully understood. In touh with S. Ponce, it could be somehow related to the same cause bringing down disk servers at RAL months ago: each merging job requires tens of small (~100MB) files to be downloaded; as a consequence of that the system was running at the limits and further requests (ex. from the FAILOVER) were just piling up (snow ball effect). Once banned the SE for few hours yesterday afternoon, it recovered and started to work again. (we set the limit of slot to be 80)
    • Requested to change LHCBRDST from T1D1 to T1D0 at CERN only
  • T1 site issues:
    • RAL: Requested CASTOR to change the DN of the LHCb datamanager to be : " /C=BR/O=ICPEDU/O=UFF BrGrid CA/O=CBPF/OU=CAT/CN=Renato Santana "
    • GRIDKA: observed thousands of timeout retrieving tURL out of the LHCb_DST space token last night (GGUS:64772)

29th November 2010 (Monday)

Experiment activities: Reprocessing going at full steam. Problem with the Merging during the weekend (LHCb application problem)

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • Change the DN of the LHCb datamanager to be : /C=BR/O=ICPEDU/O=UFF BrGrid CA/O=CBPF/OU=CAT/CN=Renato Santana
    • CASTOR LHCBDST service class being unusable almost all w/e (GGUS:64693). The intense merging activity and the very aggressive FAILOVER system in LHCb brought to a situation with many queued transfers and the two disk servers over there couldn't cope with the load.
  • T1 site issues:
    • SARA : problem of USER file registered with wrong role. (GGUS:64659)

26th November 2010 (Friday)

Experiment activities: Reprocessing is started and is going well

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • SVN Issue in committing.
  • T1 site issues:
    • SARA : [SE][srmRm][SRM_AUTHORIZATION_FAILURE] Permission denied (GGUS:64659)
    • IN2P3 : We had a discussion with IN2P3 this morning and we agree on this plan. As we see than some fraction of jobs are finishing successfully, we ask them to remove from our SHARE the WN which was not behaving properly and to apply the modifications to these "faulty" WN and to put them in production. Timeout has been increased in our LHCbDIrac software.

25th November 2010 (Thursday)

Experiment activities: Validation in progress and going well.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 2
  • T2: 0

Issues at the sites and services

  • T0
    • LFC : if the expert is not around nothing happened. Is it acceptable ? (CT727281 ON HOLD [Fwd: Strange SE in LFC ])
  • T1 site issues:

24rd November 2010 (Wednesday)

Experiment activities: Validation in progress and going well (even IN2P3 is in business) !!!

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
  • T1 site issues:

23rd November 2010 (Tuesday)

Experiment activities: Quite w/e. New release of DaVinci and LHCbDirac available,Validation will be performed this afternoon. How to proceed to test cernvmfs on the T1 ?

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
  • T1 site issues:

22nd November 2010 (Monday)

Experiment activities: Quite w/e. No new release of DaVinci available, no tests carried out as planned. SAM tests for CE and CREAMCE showed on Saturday problem with libraries used by m/w clients to submit jobs. Problem fixed on Sunday morning. Saturday's test were not running on the WN and sites might have notice that as missing results.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
  • T1 site issues:
    • CNAF : network problem at the center since Saturday evening. People working on that during the whole weekend and problem found to be a faulty network card on the core switch not detected by the firmware.

19th November 2010 (Friday)

Experiment activities: Validation of the reprocessing in progress. MIgrated SAM critical tests to a new OS VOBOX and some temporary errors (not related to the site services) may have been experienced during this intervention this morning. Situation restored.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
  • T1 site issues:
    • CNAF : problem of lcas authorization on the gridftp servers. There was an update on Nov 17 and something went wrong (fixed)
    • RAL: Observed issues with the SRM at RAL yesterday (read-ony file system cannot write)
    • IN2p3: shared area issues reported by SAM

18th November 2010 (Thursday)

Experiment activities: Validation of the reprocessing in progress..

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
  • T1 site issues:
    • RAL : checksum is available only for new file. So If we check the "old" files, the checksum is 0.
    • CERN ; webserver serving the LHCb distribution application was not available.

17th November 2010 (Wednesday)

Experiment activities: Validation of the reprocessing in progress..

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
  • T1 site issues:
    • SARA : data unavailable : (GGUS : 64348)

16th November 2010 (Tuesday)

Experiment activities: Validation of the reprocessing in progress..

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
  • T1 site issues:
    • RAL : remove file failed for USER data (GGUS : 64265)
    • SARA : downtime BUT the streaming for CONDDB was still active.

15th November 2010 (Monday)

Experiment activities: Validation of the reprocessing in progress..

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • Debugging the issue with accessing data on CASTOR via xrootd. Some fix (cleanup of gridmapfile-like configuration in xrootd) has been applied on Friday and seems to improve the situation. So we confirm or not during the coming days that it fix one part of the problem.
    • Open another ticket (uncorrelated with xrootd issue) to track down some timeouts accessing files. (GGUS:64258)
  • T1 site issues:

12th November 2010 (Friday)

Experiment activities: Massive clean up campaign launched across all T1 sites for old production and user spaces.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • Debugging the issue with accessing data on CASTOR. Decision taken at the LHCb production meeting: if the problem is still there by Sunday evening, LHCb will ask to roll back on Monday morning the intervention held of last Monday, even if, in principle, there is no relation with the problem. At least we will have a clear proof of it.
  • T1 site issues:
    • GRIDKA: observed ~80% of the transfers to the LHCb_MC-M-DST space token (connection time out error). Under investigation, GGUS ticket eventually to be submitted by the shifter.

11th November 2010 (Thursday)

Experiment activities: MC campaign still going on, users jobs, reconstruction tail.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • Requested a SIR for the identified problem of xrootd re-director affecting all our user jobs since three days after the intervention on Monday. Mainly the PM must address the reason why in the risk assessment of this intervention (supposed to be transparent) it has been overlooked this potential problem.
    • Users are still experiencing issues accessing data (GGUS:64166).
  • T1 site issues:
    • RAL : problem accessing file with rootd (GGUS:64163)

10th November 2010 (Wednesday)

Experiment activities: MC campaign still going on, users jobs, reconstruction tail.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • Main issue: data access problem reported in this ticket still ongoing GGUS:64043 (re-opened this morning). The plot below shows the number of user jobs failing on the grid in the last 24 hours (per site). CERN_24h.png
  • T1 site issues:
    • NTR

9th November 2010 (Tuesday)

Experiment activities: MC campaign still going on, users jobs, reconstruction tail

New GGUS (or RT) tickets:

  • T0: 2
  • T1: 0
  • T2: 1

Issues at the sites and services

  • T0
    • GGUS:64043 (opened yesterday). Files could not be accessed. CASTOR support now asking to close it, we still see some failures. Re-checking on our side.
    • GGUS:63933 (opened 5 days ago). Shared area slowness causing jobs to fail. Seems to be understood.
  • T1 site issues:
    • NTR

8th November 2010 (Monday)

Experiment activities: MC campaign all the weekend

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 7

Issues at the sites and services

  • T0
    • Transparent intervention on CASTOR (10-12 UTC). Maybe not really transparent? GGUS:64043
  • T1 site issues:

5th November 2010 (Friday)

Experiment activities: Mostly users jobs, few MC and reconstruction tail

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
  • T1 site issues:

4th November 2010 (Thursday)

Experiment activities: Mostly users jobs, few MC and reconstruction tail

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 2

Issues at the sites and services

  • T0
    • NTR
  • T1 site issues:
    • RAL: Diskserver out of production for few hours (GGUS:63902)
    • PIC: As reported by pic yesteday, we are experiencing a shared area slowness.
    • NIKHEF: GGUS:63816 (opened yesterday). Pilots were aborted because jobs exceed the memory limits. Closed now

3rd November 2010 (Wednesday)

Experiment activities: Mostly MC and users jobs

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • NTR
  • T1 site issues:
    • NIKHEF: Pilots aborting at NIKHEF: GGUS:63816
    • IN2P3: GGUS:63559 (opened Oct 28th). Many pilots aborted at the CREAM CE. not clear how. It does not affect much the real jobs.

2nd November 2010 (Tuesday)

Experiment activities: Weekend without Data taking. Reconstruction proceeding. Few MC productions.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • NTR
  • T1 site issues:
    • RAL: GGUS:63468 (Opened 26/10) SRM service instability. RAW data had size "zero". Waiting for an update.

1st November 2010 (Monday)

Experiment activities: Weekend without Data taking. Reconstruction proceeding. Few MC productions.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • Last Wednesday GGUS:63514. Diskserver problem. Files set "invisible" for user, who can use replicas. Waiting for update.
  • T1 site issues:
    • IN2P3: "seg-faulting" problem fixed. (GGUS:63573 and GGUS:62732 closed)
    • RAL: GGUS:63468 (Opened 26/10) SRM service instability. RAW data had size "zero".
 

-- RobertoSantinel - 29-Jan-2010

Revision 12010-01-29 - unknown

Line: 1 to 1
Added:
>
>

November 2010 Reports

To the main

-- RobertoSantinel - 29-Jan-2010

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback