November 2012 Reports

To the main

30th November 2012 (Friday)

  • Reprocessing: running last jobs at RAL and Gridka "groups", restarting with new calibration/files at 10 Dec
  • Prompt reconstruction: CERN + 5 Tier2 sites
  • MC productions at T2s and T1s if resources available

29th November 2012 (Thursday)

  • Reprocessing: running last jobs at RAL, Gridka and Cnaf "groups", restarting with new files at 10 Dec
  • Prompt reconstruction: CERN + 5 Tier2 sites
  • MC productions at T2s and T1s if resources available

  • T0: NTR

  • T1:
    • FTS transfer failures to Gridka disk from different sites (GGUS:88906)

28th November 2012 (Wednesday)

  • T0: NTR

  • T1:
    • FTS transfer failures to Gridka disk from different sites (GGUS:88906). To be mentionned that we did not received the Notifictaion of the dowtime by CIC ????

27th November 2012 (Tuesday)

  • T0: NTR

  • T1:
    • FTS transfer failures to Gridka disk from different sites (GGUS:88906)

26th November 2012 (Monday)

  • Reprocessing: running last jobs at RAL, Gridka and Cnaf "groups". New conditions expected next week.
  • Prompt reconstruction: 50TB collected in last 72 hours
  • MC productions at T2s and T1s if resources available

  • T0: NTR

  • T1:
    • FTS transfer failures to Gridka disk from different sites

23rd November 2012 (Friday)

  • General: some sites have set a wall clock time limit == CPU time limit. This makes any matching using CPU work requirement useless as the wall clock limit will always fire first (unless the job has an efficiency > 1). Sites don't seem to understand the issue... Not a major problem in most cases as our job efficiency is good, but it could be degraded by events outside our control (machine too heavily loaded or too high overcommitment of the machine (slots/cores).
  • Reprocessing: ramping up again after RAL restart and sites fixing their BDII publication of queues.
  • Prompt reconstruction: idle as LHC was not delivering
  • MC productions at T2s and T1s if resources available

  • T0: NTR

  • T1:
    • Jobs running fine at RAL. Some failures at the beginning (CVMFS cache filling most likely)
    • GRIDKA transfers still in the 20% failure range (job timeout after 1 hour)

22nd November 2012 (Thursday)

  • General: some sites are reporting wrong information on their queues in the BDII (queue length with 999999 or SI00 reported as 0). LHCb will submit GGUS tickets to all of them asking for a fix of the information. This concerns mostly Tier2s. List of LHCb queues characteristics.
  • Reprocessing: slowing down as converging with prompt processing
  • Prompt reconstruction: NTR
  • MC productions at T2s and T1s if resources available

  • T0: NTR

  • T1:
    • Best wishes to RAL!
    • Still 30-40% errors in transfers from Tier0/1 to GRIDKA (transfer timeout without a single byte transferred). This is not a showstopper but a nuisance and the tree can hide the forest...
    • Peak of FTS transfer failures to/from IN2P3 around 23:00 UTC last night. Recovered rapidly...

21th November 2012 (Wednesday)

  • General: pilot filling mode disabled in order to avoid problems with CPU limit (until understood)
  • Reprocessing: NTR
  • Prompt reconstruction: NTR
  • MC productions at T2s and T1s if resources available

  • T0: NTR

  • T1: NTR... Best wishes to RAL!

20th November 2012 (Tuesday)

  • Reprocessing: NTR
  • Prompt reconstruction: NTR
  • MC productions at T2s and T1s if resources available

  • T0: problem of ghost CREAM jobs understood, but no patch available yet. It doesn't apply only to CERN, we have many tickets open on this topic. Thanks to Ulrich & C for heving digged into that with the CREAM developers! Let's hope it is fixed soon. In the mean time sites are requested to clean up their redundant jobs...

  • T1:
    • Issue with CPU power estimate at some sites (Tier1s and Tier2s). It makes the remaining CPU work estimate by the pilot unreliable, therefore we disabled the filling mode. Still problematic at some sites even for the initial job. It is necessary to have an agreement on how to estimate the CPU power for a job slot, in particular in case hyperthreading is on. This is a (too long) standing issue. GDB?

19th November 2012 (Monday)

  • Reprocessing: new set of runs launched during the WE (55000 files)
  • Prompt reconstruction: 7,000 files waiting for reconstruction after this WE good LHC performance (CERN + few Tier2s)
  • MC productions at T2s and T1s if resources available

  • T0: Castor intervention on Wednesday OK

  • T1:
    • GridKa:
      • Still having significant problems with FTS transfers to GridKa. Investigations still ongoing. (GGUS:88425). Error: globus_gass_copy_register_url_to_url transfer timed out. The timeouts are related neither to a source (happens from CERN + 5 Tier1s) nor to a gridftp server (same file succeeds to same server a few minutes later).
    • NL-T1: spikes of data upload failures from NIKHEF to SARA (very short duration), coincides with spikes of uploads (SRM overload?)
    • Recovering FTS transfers "lost" in the STAR-STAR unknown channel (fix on the DIRAC side)

16th November 2012 (Friday)

  • Reprocessing and Reconstruction ramping down while waiting for new Conditions DB
  • New processing runs should be starting later today
  • MC productions at T2s and T1s if resources available

  • T0:

  • T1:
    • GridKa:
      • Still having significant problems with FTS transfers to GridKa. Investigations still ongoing. (GGUS:88425)

15th November 2012 (Thursday)

  • Reprocessing and Reconstruction ramping down while waiting for new Conditions DB
  • MC productions at T2s and T1s if resources available

  • T0:

  • T1:
    • GridKa:
      • Still having significant problems with FTS transfers to GridKa. Now pointing towards overloaded storage using gridFTP2. (GGUS:88425)

14th November 2012 (Wednesday)

  • Reprocessing and Reconstruction ramping down while waiting for new Conditions DB
  • MC productions at T2s and T1s if resources available

  • T0:

  • T1:
    • GridKa:
      • Still having significant problems with FTS transfers to GridKa. Now pointing towards overloaded storage using gridFTP2. (GGUS:88425)
    • SARA:
      • SARA_BUFFER seemed to go offline this morning for ~2 hours. No ticket as it was fixed without intervention from VO side.

13th November 2012 (Tuesday)

  • Reprocessing at T1s and "attached" T2 sites
  • User analysis at T0/1 s
  • Prompt reconstruction at CERN + 4 attached T2s
  • MC productions at T2s and T1s if resources available

  • T0:

  • T1:
    • GridKa:
      • Still having significant problems with FTS transfers to GridKa (GGUS:88425)
      • Had a large peak of FTS transfers to GridKa-RAW this morning - we wondered if the write access was suffering due to the tuning for staging?

12th November 2012 (Monday)

  • Reprocessing at T1s and "attached" T2 sites
  • User analysis at T0/1 s
  • Prompt reconstruction at CERN + 4 attached T2s
  • MC productions at T2s and T1s if resources available

  • T0:

  • T1:
    • GridKa:
      • A number of timeouts when transferring. Possibly only restricted to a few pool nodes? (GGUS:88425)

9th November 2012 (Friday)

  • Reprocessing at T1s and "attached" T2 sites
  • User analysis at T0/1 s
  • Prompt reconstruction at CERN + 4 attached T2s
  • MC productions at T2s and T1s if resources available

  • T0:

  • T1:
    • RAL:
      • Disk server failure in the LHCb_DST space, recovered after the server restart.

8th November 2012 (Thursday)

  • Reprocessing at T1s and "attached" T2 sites
  • User analysis at T0/1 s
  • Prompt reconstruction at CERN + 4 attached T2s
  • MC productions at T2s and T1s if resources available

  • T0:
    • CERN:
      • LHCb EOS storage assignement was increased 250TB -> 450TB, but the LHCb quota was not increased resulting in the inavailability of the new capacity. Solved now by increasing the LHCb quota.

  • T1:
    • RAL:
      • General power cut, still banned for usage by LHCb

7th November 2012 (Wednesday)

  • Reprocessing at T1s and "attached" T2 sites
  • User analysis at T0/1 s
  • Prompt reconstruction at CERN + 4 attached T2s
  • MC productions at T2s and T1s if resources available

  • T0: NTR

  • T1:
    • RAL:
      • General power cut, banned for usage by LHCb

6th November 2012 (Tuesday)

  • Reprocessing at T1s and "attached" T2 sites
  • User analysis at T0/1 s
  • Prompt reconstruction at CERN + 4 attached T2s
  • MC productions at T2s and T1s if resources available

  • T0:
    • CERN:
      • Castor intervention 1pm-4pm today, 6th Nov, CERN/SRM storages will be banned for read-write access for any activity.
  • T1:
    • Gridka:
      • 2 CEs moved from WMS access to direct access, VO-lhcb-pilot tag is set to allow pilots with VOMS Role=prod. This solved the problem of aborted pilots at Gridka from yesterday

5th November 2012 (Monday)

  • Reprocessing at T1s and "attached" T2 sites
  • User analysis at T0/1 s
  • Prompt reconstruction at CERN + 4 attached T2s
  • MC productions at T2s and T1s if resources available

  • T0:
  • T1:
    • IN2P3:
      • data access failures on the weekend, retries successful
    • Gridka:
      • data access failures, agreed to increase the number of gridftp movers from 5 to 10 in each LHCb pool

2nd November 2012 (Friday)

  • Reprocessing at T1s and "attached" T2 sites
  • User analysis at T0/1 s
  • Prompt reconstruction at CERN + 4 attached T2s
  • MC productions at T2s and T1s if resources available

  • T0:
  • T1:
    • CNAF: yesterday some staging requests that were stuck have been cleaned (promptly fixed by the site, no need to open a ticket!)

1st November 2012 (Thursday)

  • Reprocessing at T1s and "attached" T2 sites
  • User analysis at T0/1 s
  • Prompt reconstruction at CERN + 4 attached T2s
  • MC productions at T2s and T1s if resources available

  • T0:
    • CERN:
      • 4 files with bad checksum on EOS (GGUS:87702) Closed (as we said yesterday)
      • redundant pilots (GGUS:87448). Still verifying if the problem is solved
      • 252 FTS jobs in the system that some time ago landed in STAR-STAR channel as Atlas jobs (GGUS:87686): still to be fixed on LHCb side.
  • T1:
    • SARA: (GGUS:87975) yesterday FTS submissions to SARA failed, promptly fixed by the site
Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2012-12-05 - StefanoPerazzini
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback