February 2013 Reports

To the main

27th February 2013 (Wednesday)

  • Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.

  • T0:
  • T1: SARA : downtime extended for CPU. IN2P3 : thanks to IN2P3 for their additionnal 200 TB of tape.

26th February 2013 (Tuesday)

  • Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.

  • T0:
  • T1: IN2P3 : (GGUS:91760) : authentication problem with one certificate used for Production: Fixed (tomcat restarted) SARA : downtime
  • DashBoard : In the "Site Groups" drop down box, RHUL does not appear if you select "All sites". However if you pick "Tier 0/1/2", then you do see LCG.UKI-LT2-RHUL.uk.

(http://dashb-lhcb-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time[]=last48&granularity[]=default&profile=LHCb_CRITICAL&group=Tier+0/1/2&site[]=LCG.UKI-LT2-RHUL.uk&type=quality)

25th February 2013 (Monday)

  • Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.

  • T0:
  • T1: IN2P3 : (GGUS:91760) : authentication problem with one certificate used for Production. SARA : downtime
  • DashBoard : In the "Site Groups" drop down box, RHUL does not appear if you select "All sites". However if you pick "Tier 0/1/2", then you do see LCG.UKI-LT2-RHUL.uk.

(http://dashb-lhcb-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time[]=last48&granularity[]=default&profile=LHCb_CRITICAL&group=Tier+0/1/2&site[]=LCG.UKI-LT2-RHUL.uk&type=quality)

22nd February 2013 (Friday)

  • Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.

  • T0:
    • ALARM ticket (GGUS:91690) for afs hosted web service which is not responding, understood and fixed
  • T1: NTR

21st February 2013 (Thursday)

  • Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.

  • T0:
    • ALARM ticket (GGUS:91690) for afs hosted web service which is not responding. It serves grid jobs for configuration and setup purposes
    • Many failures in CASTOR->EOS migration because of different checksums in LFC and CASTOR
  • T1: NTR

20th February 2013 (Wednesday)

  • Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.

  • T0: NTR
    • Migration CASTOR -> EOS progressing, estimated to last for another 6 weeks

  • T1:
    • IN2P3 : NAGIOS problem still ongoing at IN2P3 (GGUS:91126), logfiles of failed sam probes seem to indicate that the probe is killed by the batch system (logs uploaded to GGUS ticket)

19th February 2013 (Tuesday)

  • Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.

  • T0: NTR

18th February 2013 (Monday)

  • Ongoing activity as before: reprocessing, some prompt-processing, MC and user jobs.

  • T0: NTR

15th February 2013 (Friday)

  • Ongoing activity as before: reprocessing, some prompt-processing, MC and user jobs.

  • T0: NTR

  • T1:
    • IN2P3: Problem with SE at IN2P3 (GGUS:91557)
    • RAL: Job timeouts trying to set up environment on the worker node (internal ticket).
    • GridKa : Issue with srm / SE / network resolved (GGUS:91474) - thanks!

14th February 2013 (Thursday)

  • Ongoing activity as before: reprocessing, some prompt-processing, MC and user jobs.

  • T0: NTR

  • T1:
    • IN2P3: NAGIOS problem still ongoing (GGUS:91126).
    • RAL: Job timeouts trying to set up environment on the worker node (internal ticket). Continuing problems with batch system (GGUS:91251).
    • GridKa : Continuing issue with srm / SE / network (GGUS:91474). Jobs failing to resolve input data multiple times at GridKa. Jobs at JINR waiting for a long time for data from GridKa, before being killed by the batch system there. One strange DNS problem fixed yesterday.

13th February 2013 (Wednesday)

  • Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.

  • T0: NTR

  • T1:
    • IN2P3: NAGIOS problem still ongoing (GGUS:91126).
    • RAL: Problems with batch system came back (GGUS:91251)
    • GridKa : Possible issue with srm / SE / network (GGUS:91474). Jobs failing to resolve input data multiple times at GridKa. Jobs at JINR waiting for a long time for data from GridKa, before being killed by the batch system there.

12th February 2013 (Tuesday)

  • Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.

  • T0: NTR

  • T1:
    • IN2P3: NAGIOS problem still ongoing (GGUS:91126). No idea who to follow up with.
    • RAL: Problems with batch system seem to be resolved (GGUS:91251)

11th February 2013 (Monday)

  • Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.

  • T0:
    • NTR

  • T1:
    • IN2P3: NAGIOS problem still being investigated (GGUS:91126). Also, low level problem with access to data (input data resolution) - under investigation by IN2P3 contact.
    • RAL: Continuing problems with batch system (GGUS:91251)
    • FZK : Problem with FTS transfers solved over the weekend (GGUS:91315).

8th February 2013 (Friday)

  • Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.

  • T0:
    • NTR

  • T1:
    • IN2P3: NAGIOS problem still being investigated (GGUS:91126)
    • RAL: Problem with scheduler resurfaced overnight (GGUS:91251)
    • PIC: Had problems with SRM timing out. Identified as a single problematic user which was then banned.

7th February 2013 (Thursday)

  • Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.

  • T0:
    • NAGIOS problem seems to be the result of malformed output from this test. Switched to IN2P3. (GGUS:91126)

  • T1:
    • RAL: Since yesterday, no jobs have been run (GGUS:91251)

6th February 2013 (Wednesday)

  • Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.

  • T1:
    • NTR

5th February 2013 (Tuesday)

  • Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.

  • T0:
    • NAGIOS test for swdir at IN2P3 still not running as frequently as others (once in the last 24 hours) (GGUS:91126)

  • T1:
    • RAL: FTS transfers no going through without issue. Not sure what solved it but some tests still to be run.
    • RAL: Seen an increase of SetupProject errors - not a major problem, but any possible reason for this (e.g. AFS decommissioning) ?
    • IN2P3: Yesterday and last night had a number of 'Bus errors' reported across many WNs. Problem has gone away now, but we were wondering if there was a possible CVMFS glitch?

4th February 2013 (Monday)

  • Activity as last week: reprocessing, prompt-processing, MC and user jobs.

  • T0:
    • An issue with a NAGIOS test not running in the last few days (GGUS:91126)
  • T1:
    • RAL: Some FTS transfers are failing due to strange timeout during transfer. Only on some files. Experts are investigating.

1st February 2013 (Friday)

  • Nothing new to report, just few tickets for pilots aborting at Tier2s. New LFC is ok.
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2013-03-04 - JoelClosier
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback