Difference: ProductionOperationsWLCGSep11Reports (1 vs. 2)

Revision 22011-10-11 - JoelClosier

Line: 1 to 1
 

September 2011 Reports

To the main
Added:
>
>

30 September 2011 (Friday)

Experiment activities:

  • Experiment activities
    • Reconstruction and stripping
    • Starting reprocessing activities

  • T0
    • Castor: Problems with access to disk pools via xrootd protocol (GGUS:74751), waiting for user reply
    • Nagios: No probes being sent to prod-lfc-lhcb.ro (GGUS:74775)

  • T1 sites:
    • IN2P3: Increased number of stalled jobs (GGUS:74733), upgrade of CREAM-CE has fixed the problem.
    • PIC: problems with staging of files which returns with non 0 exit code but the string "DONE", the files are actually staged
    • Gridka: Observing problems with jobs with many input files when accessing storage via protocol, mostly user jobs affected by this.
    • Gridka: missing results from nagios tests for CE, fixed yesterday around 8PM

29 September 2011 (Thursday)

Experiment activities:

  • Experiment activities
    • Reconstruction and stripping
    • Finishing on validation of reprocessing applications

  • T0
    • Castor: Problems with access to disk pools via xrootd protocol (GGUS:74751)
    • Nagios: No probes being sent to prod-lfc-lhcb.ro for the last 48 hours (GGUS:74775)

  • T1 sites:
    • IN2P3: Increased number of stalled jobs (GGUS:74733). Pilots are submitted correctly but finish after a few minutes, subsequently the payload which continues to run is discovered not to have a pilot attached and is set to failed.
    • PIC: problems with staging of files which returns with non 0 exit code but the string "DONE", the files are actually staged
    • Gridka: Observing problems with jobs with many input files when accessing storage via protocol, mostly user jobs affected by this.

28 September 2011 (Wednesday)

Experiment activities:

  • Experiment activities
    • Reconstruction and stripping
    • Finishing on validation of reprocessing applications

  • T0
    • Since the week-end increased number of pilots aborted (GGUS:74657), issue fixed and ticket verified

  • T1 sites:
    • IN2P3: Increased number of stalled jobs (GGUS:74733). Pilots are submitted correctly but finish after a few minutes, subsequently the payload which continues to run is discovered not to have a pilot attached and is set to failed.
    • PIC: problems with staging of files, lcg-bringonline returns with an error message while staging files, but the files are actually staged
    • Gridka: Observing problems with jobs with many input files, mostly user jobs affected by this.

27 September 2011 (Tuesday)

Experiment activities:

  • Experiment activities
    • Reconstruction and stripping
    • Finishing on validation of reprocessing applications

  • T0
    • Since the week-end increased number of pilots aborted (GGUS:74657)

  • T1 sites:
    • SARA: Problem with pilot submission to CREAM-CEs (GGUS:74639), ticket closed and verified
    • SARA: Problems with file access via protocol (GGUS:74416), ticket closed and verified

26 September 2011 (Monday)

Experiment activities:

  • Experiment activities
    • Reconstruction and stripping
    • Finishing on validation of reprocessing applications

  • T0
    • Since the week-end increased number of pilots aborted (GGUS:74657)

  • T1 sites:
    • SARA: Problem with pilot submission to CREAM-CEs (GGUS:74639)
    • SARA: Problems with file access via protocol (GGUS:74416), nodes were upgraded, after the above problem is fixed this can be validated.

23 September 2011 (Friday)

Experiment activities:

  • Experiment activities
    • Reconstruction and stripping
    • Finishing on validation of reprocessing applications

  • T0
    • very low number of running jobs observed, fixed yesterday afternoon
    • Migration to castor for one raw file pending (GGUS:74601)

  • T1 sites:
    • SARA: Problems with file access via protocol (GGUS:74416), downtime today for upgrading pool nodes

22 September 2011 (Thursday)

Experiment activities:

  • Reconstruction and stripping
  • Validation of reprocessing applications has started

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 1

Issues at the sites and services

  • Cern: very low number of running jobs observed since this morning, currently under investigation
  • SARA: Problems with file access via protocol (GGUS:74416), upgrading pool nodes

21 September 2011 (Wednesday)

Experiment activities:

  • Reconstruction and stripping

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 1

Issues at the sites and services

  • Gridka: Problems with access to local LFC server, due to broken certificates (GGUS:74476), fixed
  • SARA: Problems with file access via protocol (GGUS:74416), upgrading pool nodes

20th September 2011 (Tuesday)

Experiment activities:

  • Reconstruction and stripping

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 1

Issues at the sites and services

  • Gridka: Problems with access to local LFC server, due to broken certificates (GGUS:74476)
  • SARA: Problems with file access via protocol (GGUS:74416)

19th September 2011 (Monday)

Experiment activities:

  • Reconstruction and stripping

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 1

Issues at the sites and services

  • SARA: Problems with file access via protocol, several pool nodes were rebooted during the w/e (GGUS:74416), NIKHEF also affected by the same problem (GGUS:74427). Another ticket opened b/c after removal and archiving campaign executed on the w/e, the disk pools in front of tape storage got full (GGUS:74441)
  • CNAF: Problems with file access, turned out to be a problem with the GPFS file system, which was restarted on Sunday (GGUS:74428)
  • dcache sites were asked to re-allocate disk space from "old" LHCb disk space tokens after removal campaign (GGUS:74365, GGUS:74366, GGUS:74367, GGUS:74368)

16th September 2011 (Friday)

Experiment activities:

  • Reconstruction and stripping, LHCb magnet polarity change today

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 1

Issues at the sites and services

  • SARA: Problem with network tonight, storage and switches were not communicating properly.
  • IN2P3: SharedArea Problem b/c of overloaded volume, the LHCb volume was moved, problem fixed (GGUS:74334)

15th September 2011 (Thursday)

Experiment activities:

  • Reconstruction and stripping.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 1

Issues at the sites and services

14th September 2011 (Wednesday)

Experiment activities:

  • Reconstruction and stripping. Trigger was changed before packages were installed on grid, as result all jobs failed at GRIDKA and CNAF. Fixed by installing required packages.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

13th September 2011 (Tuesday)

Experiment activities:

  • Low level activity; reconstruction and stripping.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 1

Issues at the sites and services

  • GRIDKA: Pilots aborted (GGUS:74265) - fixed very quickly.
  • CIC: Downtime notification fixed ( 167 notifications today )

12th September 2011 (Monday)

Experiment activities:

  • We have only few short runs during weekend.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 3

Issues at the sites and services

  • CERN:
    • (GGUS:74175) Several files that are supposed to be on Castor are not accessible

  • CIC: We have no any downtime notification last week.

9th September 2011 (Friday)

Experiment activities:

  • Data processing. Due to error during software distribution procedure, last tag for Conditional DB missed in Oracle DB. As result all reconstructions jobs failed yesterday. Fixed in few hours.

New GGUS (or RT) tickets:

  • T0: 2
  • T1: 0
  • T2: 5

Issues at the sites and services

  • CERN:
    • (GGUS:74169) few jobs failed with "Payload process could not start after 10 seconds"
    • (GGUS:74175) Several files that are supposed to be on Castor are not accessible

7th September 2011 (Wednesday)

Experiment activities:

  • Waiting for data.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • SARA: We have ticket (GGUS:73244) opened month ago, it seems we need help from experts.

6th September 2011 (Tuesday)

Experiment activities:

  • Waiting for data.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • Nothing to report

5th September 2011 (Monday)

Experiment activities:

  • Waiting data.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T1 :
    • RAL : Network connection problem (Solved)

2nd September 2011 (Friday)

Experiment activities:

  • Processing and stripping is finished. Nothing to report.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

1st September 2011 (Thursday)

Experiment activities:

  • Processing and stripping is finished. Nothing to report.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T1 : * IN2P3 : no answer to GGUS ticket from yesterday.. Any reason ? (GGUS:73959)
 

Revision 12011-07-01 - unknown

Line: 1 to 1
Added:
>
>

September 2011 Reports

To the main

-- RobertoSantinel - 01-Jul-2011

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback