Daily WLCG Operations calls :: collection of LHCb reports

Starting from April 2009, this twiki collects all the LHCb reports given to the weekly WLCG calls at 3pm Geneva time Mondays and Thursdays. These calls are attended by the LHCb Grid expert and/or Concezio Bozzi (on behalf of LHCb). CONNECTION details: For remote participation we use the Vidyo system. Instructions can be found here (Deprecated : Alcatel URL). These reports have to be duly compiled by the GEOC as part of his mandate

Previous reports are available per year:
2018 2017 2016 2015 2014 2013 2012 2011 2010 2009


Jump to a date:

26th September 2022

  • No significant issues to report - business as usual.

19th September 2022

  • Issues:
    • SARA:
      • Issues with pilot submission this morning. Restart of services seems to have sovled the problem. (GGUS:158934)
    • PIC:
      • Pilot submission problems (GGUS:158857).
        • Investigations ongoing. Error found in CE logs but cause is unknown at present
    • RRCKI:
      • Pilots getting killed due to hitting 8GB virtual memory limit
        • Question: Is there a general WLCG policy for virtual memory limits?

12th September 2022

  • Issues:
    • IN2P3:
      • Another problem with Pilot Submission, similar to last week apparently (GGUS:158856)
    • PIC:
      • Pilot submission problems (GGUS:158857).
        • Investigations ongoing. Seems to be only a problem on one CE

5th September 2022

  • Issues:
    • IN2P3:
    • NIKHEF:
      • Pilot submission problems (GGUS:158718). Investigations ongoing.

29th August 2022

  • Apologies - Mark S may not be able to make the meeting

  • Issues:
    • RAL: - Problems with jobs uploading data (GGUS:158574)

    • SARA: - Problems with data access. Still investigating on LHCb side but is there any known problem on the SARA side?

22nd August 2022

  • Issues:
    • RAL: - New hardware added to help with slow gateways problem (GGUS:156492) - Echo Deletion problems (GGUS:155120) - Issue of XRootd proxy serialising transfers identified being worked on

    • Timeout issues at SARA found to be due to limit on simultaneous connections (GGUS:153653) - Solution proposed of increasing number of connections and using a '-n' flag for hadd

15th August 2022

  • Issues:
    • Any updates on the connection timeout issues at SARA (GGUS:153653) ?
      • Current status is that it is reproducable on both IPv6 and IPv4 on a test VM

8th August 2022

  • Apologies - Mark S is away so can't join the meeting.

  • Issues:
    • Work ongoing on connection timeout issues at SARA (GGUS:153653)
      • Current status is that it is reproducable on both IPv6 and IPv4 on a test VM

1st August 2022

  • Activity:
    • DIRAC in downtime from tonight for ~24 hours for updates to MySQL DBs

  • Issues:
    • Network configuration changed bewteen PIC and CNAF to fix asymmetry (GGUS:157955, GGUS:158004)
      • Looks to have worked but will keep an eye on it
    • Ongoing connection timeout issues at SARA (GGUS:153653)
      • Possibly isolated to IPv6, but tests ongoing
      • Becoming a signficant issue as data needs to be transferred to different sites to process them
    • RAL slow deletion problem update (GGUS:155120)
      • Request to try to add a timeout to gfal removal requests but this is not available in the gfal client at present

25 July 2022

  • Issues:
    • Work ongoing on the issues transferring beteen PIC and CNAF/INFN (GGUS:157955, GGUS:158004)
      • Asymmetry between the route between CNAF and PIC identified.
      • WLCG Network Throughput team is following up with GARR, CNAF and PIC
    • Signficant transfer problems to RAL on Friday due to CEPH key ring issues (GGUS:158120). Solved in good order.

18 July 2022

  • Issues:
    • Work ongoing on the issues transferring beteen PIC and CNAF/INFN (GGUS:157955, GGUS:158004)
      • Asymmetry between the route between CNAF and PIC identified.

11 July 2022

  • Issues:
    • Issues transferring beteen PIC and CNAF/INFN.
      • Investigations on the PIC side seem to show routing problems to some IPs (GGUS:157955)
      • Ticket now opened against CNAF/INFN to see if they can verify this. (GGUS:158004)

27 September 2021

  • Activity:
    • MC and WG productions; Restripping ongoing, user jobs
  • Issues:
    • NTR

20 September 2021

  • Activity:
    • MC and WG productions; user jobs
  • Issues:
    • NTR

13 September 2021

*Activity:

    • MC and WG productions; user jobs *Issues:
    • NTR

6 September 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • RAL: Found a workaround for transfer issues at RAL. Problems caused by script being sourced at RAL WNs that set GFAL (and other) env variables (GGUS:153532)

30 August 2021

  • Activity:
    • MC and WG productions; Staging Ongoing; User jobs.
  • Issues:
    • RAL: Still have transfer issues but progress is being made (GGUS:153532)
    • CNAF: Cannot access some files locally. Investigations ongoing (GGUS:153578)
    • If you Google 'GOCDB' you get directed to the pre-prod site and there's no discernible difference except the DTs aren't registered properly. Can banner be added please?

16 August 2021

  • Activity:
    • MC and WG productions; Staging Ongoing; User jobs.
    • Re-Stripping Campaign might start next week
  • Issues:
    • RAL: Had an issue with Downtime over the weekend

2 August 2021

  • Activity:
    • MC and WG productions; Some Staging; User jobs.
  • Issues:
    • NTR

26 July 2021

  • Activity:
    • MC and WG productions; User jobs.
    • Large stripping campaign is being started soon - this will involve significant staging from everywhere
  • Issues:
    • CERN: Issues with FTS transfers over the last week. CTA admins investigating. (GGUS:153132)

19 July 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • NIKHEF: Issues with one CE (brug.nikhef.nl) for last week or so. Pilots being aborted and now submission problems (GGUS:152946)

12 July 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • NTR

5 July 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • CERN: Requested an increase of 1.5 PB to continue staging (GGUS:151868)
    • IN2P3: Failed transfers due to faulty RAID controller (GGUS:152859)

7 June 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • NTR

29 March 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • T0+1: Main issue is staging CTA -> RAL. Addressed in (GGUS:150898)

22 March 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • CERN: Migration to CTA fully over. In production. Some failed FTS transfers to RAL (checksum issue), some TPC issues found and 2 xroot issues open
    • T0+1: moving to using Singularity for payload isolation. Few issues found at RAL last week (Singularity inside Docker), would need to re-check if still there.

15 March 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • CERN: Migration to CTA is ongoing. Testing, re-opening the valves this afternoon if everything's OK.

01 March 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • CERN: Migration to CTA to start this week, putting CASTOR in read-only mode on Wednesday.

22 February 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • CERN: ticket (GGUS:150647) for checksum issues, in progress
    • Migration to CTA might start next week, and we might stop CASTOR this week
    • Submissions to HTCondorCEs not anymore problematic (some tweaking on our side)

15 February 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • CERN: ticket (GGUS:150584) for jobs HELD at CERN
      • ticket was closed - we lost pilots there.
    • IN2P3:Job submission issue (GGUS:150406)
      • was closed
    • GRIDKA: Job submission issue (GGUS:150403)
    • from the above tickets, and from other ones: we have a general issue on how to run on HtCondorCEs. We don't want to run a local schedd as this is not something formally requested. As a temporary measure, we have been adding a line in our submission string so that the job (pilot) outputs is deleted in the next 24 hours, but this might be not ideal/fragile. A proper solution will not come easily (would require development) and this is not on the horizon.

01 February 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:

11 January 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • IN2P3: xroot TPC issues. Investigation ongoing
    • NL-T1: preparation for dCache namespace migration ongoing

04 January 2021

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • IN2P3: Number of Running jobs decreased: (GGUS:150078)

14th December 2020

  • NTR
  • Activity:
    • MC and WG productions; User jobs.

30 November 2020

  • PIC dCache namespace migration today without downtime
  • Activity:
    • MC and WG productions; User jobs.

16 November 2020

  • IN2P3 dCache namespace migration to be performed 17/11/20 (tomorrow) without downtime
  • Activity:
    • MC and WG productions; User jobs.

09 November 2020

  • IN2P3 dCache namespace migration to be performed 17/11/20 without downtime
  • Activity:
    • MC and WG productions; User jobs.

02 November 2020

  • Activity:
    • MC and WG productions; User jobs.

26 October 2020

  • Activity:
    • MC and WG productions; User jobs.

19 October 2020

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • RAL: Issue with one user unable to access data due to auth issue (GGUS:148701) Resolved

21 September 2020

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • RAL: Issue with one user unable to access data due to auth issue (GGUS:148701)

7 September 2020

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:
    • CNAF: Issue discovered with Storm release from ~10 days ago that causes puppet to restart service. Puppet temporarily stopped until fix is released.
    • IN2P3: Issue with users unable to download data from storage due to auth problem (GGUS:148550)

31 August 2020

  • Activity:
    • MC and WG productions; User jobs.
  • Issues:

24 August 2020

  • Activity:
    • Usual MC, user and WG production.
  • Issues:
    • SARA/NIKHEF, GGUS:148321 : Timeout during transfers from SARA-BUFFER. Already investigated last week and now the problem is much mitigated but still there. Will updated ticket.

10 August 2020

  • Activity:
    • Usual MC, user and WG production.
  • Issues:
    • IN2P3 decreased number of running job : Fixed
    • RAL FTS3 transfers issue : GGUS:148187 : Fixed

03 August 2020

  • Activity:
    • Usual MC, user and WG production.
  • Issues:

27 July 2020

  • Activity:
    • Usual MC, user and WG production.
    • Namespace migration of dCache completed last week. Many thanks to GridKa and dCache experts for help with this.

20 July 2020

  • Activity:
    • Usual MC, user and WG production.
    • Tomorrow, dCache namespace re-ordering at Gridka
    • Exceptionaly agreed to have two T1 (CNAF/GRIDKA) down at the same time

13 JULY 2020

  • Activity:
    • Usual MC, user and WG production.
    • Preparing Bookkeeping Downtime for Oracle DB upgrade on Wednesday

13 JULY 2020

  • Activity:
    • Usual MC, user and WG production.
    • Preparing Bookkeeping Downtime for Oracle DB upgrade on Wednesday

29 JUNE 2020

  • Activity:
    • Usual MC, user and WG production.
  • Issues:

15 JUNE 2020

  • Activity:
    • Usual MC, user and WG production.
  • Issues:
    • CERN: DATA transfer problem GGUS:147426 (fixed)
    • FZK-LCG2: CE unavailable GGUS:147431 (fixed)
    • RAL: low number of running jobs

08 JUNE 2020

  • Activity:
    • Usual MC, user and WG production. Started middle last week the restripping of PbNe 2018 data (should be finished in 1 or 2 days).
  • Issues:
    • PIC: Tape buffer became full due to data recall for restripping campaign. Now buffer size increased (GGUS:147344).

01 JUNE 2020

  • Activity:
    • Usual MC and working group productions
  • Issues: Nothing new to report.

25 May 2020

  • Activity:
    • Usual MC and working group productions
  • Issues: Nothing new to report.

18 May 2020

  • Activity:
    • Ongoing WG MC productions, Heavy Ion 2018 stripping validation
  • Issues: Nothing new to report

11 May 2020

  • Activity:
    • Ongoing WG MC productions
  • Issues:
    • GRIDKA: Some ongoing file access issues. FZK are waiting on vendor. (GGUS:146379)

4 May 2020

  • Activity:
    • Ongoing WG MC productions
    • We have ticketed all Tier 2 sites that are still running SL6 to ask them to upgrade to CC7. Not surprisingly, they have mostly said it'll have to wait until after quarantine!
  • Issues:
    • CERN: Issues with file access on EOS. Solved and ticket to be closed (GGUS:146673)
    • GRIDKA: Some ongoing file access issues. FZK are waiting on vendor. (GGUS:146379)

27 April 2020

  • Activity:
    • Preparing for the next Stripping round
    • Ongoing WG MC productions
  • Issues:

20 April 2020

  • Activity:
    • Preparing for the next Stripping round
    • Ongoing WG MC productions

6 April 2020

  • Activity:
    • Staging for next stripping round.

30 March 2020

  • Activity:
    • Stripping.
    • Staging for next stripping round.
  • Issues:

23 March 2020

  • Activity:
    • Stripping campaign is finished.
    • Staging is finished.
  • Issues:
    • NTR

16 March 2020

  • Activity:
    • Stripping campaign ongoing, occupying most of T0/1 capacity (no T2s)
    • Staging for 2016 is almost finished. Staging for 2017 is ongoing.
  • Issues:
    • NTR

24 February 2020

  • Activity:
    • Stripping campaign ongoing, occupying most of T0/1 capacity (no T2s)
    • Staging for 2016 is almost finished. Staging for 2017 is ongoing.
    • Validating "new" role for SAM/ETF tests.
  • Issues:
    • CERN: News about the issues last week? (no DT?).
    • CNAF: low level of running jobs (GGUS: 145692). No reply yet.

17 February 2020

  • Activity:
    • Stripping campaign ongoing, occupying most of T0/1 capacity (no T2s)
    • Staging for 2016 is almost finished. Staging for 2017 is ongoing.
  • Issues:
    • no significant issues

10 February 2020

  • Activity:
    • Stripping campaign ongoing, occupying most of T0/1 capacity (no T2s)
    • Staging re-started last week, almost finished everywhere
  • Issues:
    • no significant issues

27 January 2020

  • Activity:
    • Stripping campaign ongoing
    • Staging will re-start this week based on the stripping activity
  • Issues:
    • no significant issues

20 January 2020

  • Activity:
    • Heavy staging at sites will start again during the week because of stripping campaign

13 January 2020

  • Activity:
    • Running MC simulation, WG productions, user analysis.
  • Issues:
    • RAL: failing access by protocol to files in ECHO

6 January 2020

  • Activity:
    • Running MC simulation, WG productions, user analysis.
  • Issues:
    • RAL: failing access by protocol to files in ECHO

16 December 2019

  • Activity:
    • Running MC simulation, WG productions, user analysis.
  • Issues:
    • RAL: failing access by protocol to files in ECHO

9 December 2019

  • Activity:
    • Running at ~100K jobs: MC simulation, WG productions, user analysis.

2 December 2019

  • Activity:
    • Smooth running at ~100K jobs: MC simulation, WG productions, user analysis.
    • Had already staged 2015,2016,2017 data for the re-strippign campaign, awaiting for physics groups validation
  • Issues:
    • no significant issues

25 November 2019

  • Activity:
    • Smooth running at ~120K jobs: MC simulation, WG productions, user analysis.
    • have staged 2015,2016,2017 data for the re-strippign campaign, awaiting for validation by physics groups
  • Issues:
    • no significant issues

18 November 2019

  • Activity:
    • Smooth running at ~120K jobs: MC simulation, WG productions, user analysis.
    • have staged 2015,2016,2017 data for the re-strippign campaign, awaiting for validation by physics groups
  • Issues:
    • no significant issues

11 November 2019

  • Activity:
    • Usual MC, user jobs and data restripping.
    • Continuing staging (tape recall) at all T1s
  • Issues:
    • Slow transfers (timeouts) for data upload from WN's to external SEs at RAL and FZK (being looked at)
    • FTS transfer failures from SARA (presumably network problems at some pool nodes, nodes rebooted)
    • RAL: failing direct access to files in ECHO if several applications access the same file simultaneously
    • IN2P3 - singularity is still not available, becoming urgent

04 November 2019

  • Activity:
    • MC, user jobs and data restripping.
    • Continuing staging (tape recall) at all T1s
  • Issues:
    • NTR new

28 October 2019

  • Activity:
    • MC, user jobs and data restripping.
    • Continuing staging (tape recall) at all T1s
  • Issues:
    • CNAF: Files can not be accessed (ticket: 143816, in progress)
    • GIRDKA: ARC CE Unavailable (ticket: 143814, in progress )
    • IN2P3: It's the only T1 that is not ready for singularity. No representative!

21 October 2019

  • Activity:
    • MC, user jobs and data restripping.
    • Continuing staging (tape recall) at all T1s

  • Issues:
    • GRIDKA: Issue with Data transfers saturday morning, fixed after a few hours. Investigating for lost files.

30 September 2019

  • Activity:
    • MC, user jobs and data restripping.
    • Continuing staging (tape recall) at all T1s

  • Issues:
    • RAL:
      • GGUS:142350; (old ticket) Under investigation : problem with user jobs at RAL.
      • GGUS:143323; Problem deleting files on ECHO at RAL.
      • RAL running jobs at much lower level since last Wednesday, than previously

23 September 2019

  • Activity:
    • MC, user jobs and data restripping.
    • Continuing staging (tape recall) at all T1s

  • Issues:
    • CERN:
    • RRCKI:
    • NIKEFF:
    • RAL:
      • GGUS:142350; (old ticket) Under investigation. User jobs increased, no queue. Issue seems to continue.

16 September 2019

  • Activity:
    • MC, user jobs and data restripping.
    • Continuing staging at T1s

  • Issues:
    • RAL:
      • GGUS:142350; (old ticket) Under investigation. User jobs increased, no queue. Issue seems to continue.

9 September 2019

  • Activity:
    • MC, user jobs and data restripping.
    • Massive staging at all T1

  • Issues:
    • RAL:
      • GGUS:142350; Under investigation. User jobs increased, no queue. Issue seems to continue.

2 September 2019

  • Activity:
    • MC, user jobs and data restripping.
    • Massive staging at all T1

  • Issues:
    • RAL:
      • GGUS:142350; Under investigation. User jobs increased, no queue. Issue seems to continue.

26 August 2019

  • Activity:
    • MC, user jobs and data restripping.
    • Massive staging at all T1

  • Issues:
    • RAL:
      • GGUS:142350; still issues accessing files on ECHO. Under investigation.
    • CNAF:

19 August 2019

  • Activity:
    • MC, user jobs and data restripping.
    • Massive staging at all T1

  • Issues:
    • RAL:
      • GGUS:142350; still issues accessing files on ECHO. Under investigation.
    • CNAF:
      • Outage. DT extended until Wed.
    • IN2P3
      • Staging problem

12 August 2019

  • Activity:
    • MC, user jobs and staging.
    • In the coming weeks, we will perform a lot of tape recall.
  • Issues:
    • RAL:
      • GGUS:142350; still issues accessing files on ECHO. Under investigation.
    • CNAF:
      • Outage since mid last week. DT extended until Wed. aprox.
    • PIC:
      • Problem with ONLINE.git access (ticket opened 142673 ). Response: heavy load, response "not optimal"
    • IN2P3
      • FTS failures IN2P3-RDST -> BUFFER (ticket opened 142670 ). Thousands of "[SE][StatusOfBringOnlineRequest][SRM_FAILURE] ". No response on the ticket.
    • NIKEF
      • No jobs running ( 142680 ), coming back now (this morning). Ticket can be closed.

5 August 2019

  • Activity:
    • MC, user jobs and staging.
    • in the coming weeks we will perform a lot of tape recall.
    • remind T2D sites that they should close their solved tickets
  • Issues:
    • CBPF:
    • RAL:
      • GGUS:142350; still issues accessing files on ECHO. Under investigation.

29 July 2019

  • Activity:
    • MC, user jobs and staging.
    • Important to have all sites with split aliases otherwise need to maintain an hack in the code. Still missing SARA (split is difficult form their side).
  • Issues:
    • RAL:
      • GGUS:142350; accessing files problem on ECHO. Under investigation.

22 July 2019

  • Activity:
    • MC, user jobs and staging. Validation for reprocessing of 2011 and 2012 data finished, will start productions soon.
    • Almost all sites agreed to split aliases for end-point of disk and tape. Still missing SARA (split is difficult form their side). Important to have all sites with split aliases otherwise need to maintain an hack in the code.
  • Issues:
    • RAL:
      • GGUS:142350; problem accessing files on ECHO. Under investigation.
      • GGUS:142337; issues with killed pilots. Under investigation

15 July 2019

  • Activity:
    • MC, user jobs and staging.
  • Issues:
    • CNAF: All data transfers Failed at INFN-T1; ticket GGUS:142239; fixed

26 June 2019

  • Activity:
    • MC, user jobs and re-stripping of 2018 data.
  • Issues:
    • SARA-MATRIX: Files access problem; alarm ticket GGUS:141783; fixed
    • CNAF: Data transfers failure; alarm ticket GGUS:141790; fixed
    • RAL: Problem with CASTOR GGUS:141872

17rd June 2019

  • Activity:
    • Usual activity running at 110k jobs: MC, user jobs and re-stripping of 2018 data.
  • Issues:
    • NTR

3rd June 2019

  • Smooth running at ~100K jobs, Usual activity
    • User jobs, MC productions, and WG productions this week
  • Issues
    • RAL:
      • Timeouts when accesing job input data (GGUS:141462)
      • Auth failures for accesing files by user jobs (GGUS:141262)
    • CERN:
      • Poor transfer efficiency from CERN WN to outside storage GGUS:141112

27th May 2019

  • Smooth running at ~100K jobs, Usual activity
    • User jobs, MC productions, and WG productions this week
  • issues which are not significant, but potentially may be of interest to other experiments:
    • in progress: Poor transfer efficiency from CERN WN to outside storage GGUS:141112
    • Users getting : [FATAL] Auth failed at RAL GGUS:141262

20th May 2019

  • Usual activity
    • User jobs, MC productions, and staging this week
  • no significant issues to report

13th May 2019

  • Activity
    • User jobs, MC productions, and staging this week
  • Issues

6th May 2019

  • Activity
    • User jobs, MC productions, and staging this week
  • Issues
    • CERN:
    • RAL:
      • Continuing migration from Castor to ECHO

29th April 2019

  • Activity
    • User jobs, MC productions, and staging this week
  • Issues
    • RAL:
      • Continuing migration from Castor to ECHO
    • IN2P3:
      • Unscheduled warning downtime this morning for Patch for NFS mount problem

15th April 2019

  • Activity
    • User jobs, MC productions, staging and some reprocessing this week.
  • Issues
    • RAL:
      • Continuing migration from Castor to ECHO
      • A disk server (gdss811) is down - causing various hold-ups and slow-downs of the different productions and the migration
    • PIC : Machine ran out of disk space (GGUS:140715) fixed now - thanks!
    • IN2P3 : Batch system issues (GGUS:140652) possibly ongoing

8th April 2019

  • Activity
    • User jobs, MC productions, staging and some reprocessing starting this week.

  • Issues
    • RAL:
      • A restart of docker killed a number of jobs last week. RAL investigating the course ( GGUS:140589)
      • A disk server was in a bad state that caused timeouts on opening some files (GGUS:140599)

1st April 2019

  • Activity
    • User jobs and MC productions

  • Issues
    • CERN: several tickets open:
    • PIC: All pilots failed. There was an error in the JobRouting definition in HTCondor-CE - solved ( GGUS:140482)

25th March 2019

  • Activity
    • User jobs and MC productions

  • Issues
    • CERN: several tickets open:
    • IN2P3: All data transfers Failed at IN2P3-CC, but problem solved and understood - it was due to the CRL update ( GGUS:140354)

18th March 2019

  • Activity
    • User jobs and MC productions

  • Issues
    • CERN: Some tickets for CERN are still open
    • IN2P3: downtime
    • CNAF: FTS3 transfers to QMUL

11th March 2019

  • Activity
    • User jobs and MC productions

  • Issues
    • CERN: Some tickets for CERN/EOS are still open, even thogh problems mostly gone. Not clear why.

4th March 2019

  • Activity
    • User jobs and MC production

  • Issues
    • CERN: Some ongoing EOS issues both writing and reading GGUS:139927
    • CERN/VOMS: Proxy renewal for SAM tests has stopped working. VOMS team investigating GGUS:139920 (Update: This looks like it is now fixed!)

25th February 2019

  • Activity
    • User jobs and MC production

  • Issues

18th February 2019

  • Activity
    • User jobs and MC product
    • Stripping s35

11th February 2019

  • Activity
    • User jobs and MC product
    • Stripping s35

  • Sites Issues

4th February 2019

  • Activity
    • User jobs and MC product
    • Stripping s35 and s35r1 for PbPb

28th January 2019

  • Activity
    • Data reconstruction for 2018 data on going
    • User jobs running and MC jobs at "full steam"
  • Sites Issues
    • CERN : NRT
    • Tier-1s : NTR

21st January 2019

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Sites Issues
    • CERN : NRT
    • Tier-1s : NTR

14th January 2019

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • CERN : (GGUS:139077) closed, thanks Jan. Reclcle bin "back to a nice safety margin".
    • RAL : Aborted pilots (GGUS:139081)

7 January 2019

  • Activity
    • Data reconstruction for 2018 data
    • User, WG processing and MC jobs
  • Site Issues
    • CERN : Curious to know the status of the CERN cloud T-systems (GGUS:139080), RHEA (GGUS:138848)
    • CERN : Also ran out of space in EOS "recycle-bin" (GGUS:139077) earlier today. Requesting a shorter retention period for now, before we decide on further measures
    • RAL : Aborted pilots (GGUS:139081)
Also a few other issues over the holiday period which were resolved either internally or through GGUS tickets.

17 December

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
    • Staging data for reprocessing in 2019
  • Site Issues
    • SARA: Ticket open during the weekend concerning tape migration issues, fastly fixed saturday night... Thanks a lot!
  • Thanks all Sites for this great year!

10 December

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
    • Staging data for reprocessing in 2019
  • Site Issues

03 December

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • SARA: Ticket open concerning data transfers problems (GGUS:138472) site waiting on CERN input

26 November

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • SARA: Ticket open concerning data transfers problems (GGUS:138472)

19 November

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • PIC: Data Access problems during the weekend, solved
    • GRIDKA: Downtime declared for tomorrow
    • SARA: Ticket open concerning data transfers problems (GGUS:138293)

12 November

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • CERN: spike in failed jobs on Sunday, currently investigating, FTS delegation issue (GGUS:138063)
    • SARA,IN2P3 FTS3 data transfer problem SARA <=> IN2P3 (GGUS:137967) (GGUS:137972)
    • RAL: FTS issues (server removed from Configuration) (GGUS:137822)
    • IN2P3: Decreased transfer efficiency (GGUS:137918)

5 November

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

29 October

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

22 October

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • RAL: FTS issues (server removed from Configuration) (GGUS:137822)

15 October

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • NTR

8 October

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • NTR

1 October

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

24 September

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

17 September

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

10 September

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • CERN: Pilot submission problem (GGUS:137037); Solved
    • CERN: Problem with accessing files (GGUS:137079)
    • CNAF: Minor problems at worker nodes

3 September

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • RAL: Failing disk server at RAL resulting in jobs failing to get input data

27 August

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • CERN: The reported problem with uploads to EOS via xrootd (GGUS 136720) is likely related to the LHCb bundled grid middleware, the fix is being tested
    • RAL: ipV6 connection problems resulting in failed FTS transfers (GGUS 136863)
    • RAL: Failing disk server at RAL resulting in jobs failing to get input data

13 August

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • NTR

06 August

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

30 July

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

23 July

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • CERN: File transfers problems. Looks like it is related to a problematic FTS server. Under investigation ( GGUS:136275 )

16 July

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • IN2P3: There's a ticket for file transfer errors. "Better now" but need to be investigated ( 136067 )
    • CNAF: Ticket opened (136120) for failing pilots; under investigation. Another ticket for file transfer errors, in progress(136123).

9 July

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs

  • Site Issues
    • NTR

2 July

  • Activity
    • Data reconstruction for 2018 data, MC simulation, user jobs

25 June

  • Activity
    • Data reconstruction for 2018 data, MC simulation, user jobs

18 June

  • Activity
    • Data reconstruction for 2018 data, MC simulation, user jobs

11 June

  • Activity
    • Data reconstruction for 2018 data, MC simulation, user jobs

4 June

  • Activity
    • Data reconstruction for 2018 data

  • Site Issues
    • NTR

28 May

  • Activity
    • Data reconstruction for 2018 data

  • Site Issues
    • NIKHEF: Pilots Failed (GGUS:135325) during weekend; Fixed.
    • Most pilots at ce515.cern.ch finished "successfully" without matching jobs due to missing CVMFS.

30 April

  • Activity
    • HLT farm off for MC

  • Updates
    • NTR
  • Site Issues
    • CERN: Staging issues. Many Conection reset by peer on Castor (GGUS:134755), related to FTS proxy-renewal. Noticed from 25th, bump of failures also this morning.

23 April

  • Activity
    • HLT farm to be used for some more time in parallel with the trigger

  • Updates
    • Deploying LHCbDIRAC with GLUE2 support today

  • Site Issues
    • IN2P3: Some tape files lost (GGUS:134666), recopied from other sites.
    • PIC: Staging problems (GGUS:134667)
    • CNAF: LHCb completed data management actions after the long downtime.

16 April

  • Activity
    • HLT farm to be used for some more time in parallel with the trigger

  • Site Issues
    • SARA: data access problems (GGUS:134545) being worked on
    • CNAF: working with the site to resurrect last 60 files for the re-stripping

9 April

  • Activity
    • HLT farm fully running
    • 2017 data re-stripping almost 100% finished
    • Stripping 29 reprocessing is ongoing

  • Site Issues
    • SARA: Data transfer issues (GGUS:134451). Being tracked down as a CRL issue.
    • CNAF: IPv6 issues on one CE (GGUS:134456) already solved
    • IN2P3: xroord server maybe broken (GGUS: 134441)

19 March

  • Activity
    • HLT farm fully running
    • 2017 data re-stripping ongoing
    • Stripping 29 reprocessing is ongoing

  • Site Issues

12 March

  • Activity
    • HLT farm fully running
    • 2017 data re-stripping ongoing
    • Stripping 29 reprocessing is ongoing

  • Site Issues
    • CNAF: coming back to life, but storage not working since Sunday evening

  • Tier2D
    • Users with UK certificate problems solved by upgrading xrootd server

05 March

  • Activity
    • HLT farm fully running
    • 2017 data re-stripping ongoing
    • Stripping 29 reprocessing is ongoing

  • Tier2D
    • Users with UK certificate are having problem to access data at CBPF, Glasgow, CSCS, NCBJ (3 DPM, 1 dCache) GGUS:133667, GGUS:133617

26 February

  • Activity
    • HLT farm fully running after dip over the weekend
    • MC simulation and user jobs
    • 2017 data restripping ongoing
    • Started stripping 29 reprocessing

  • Site Issues
    • NTR

19 February

  • Activity
    • HLT farm fully running
    • MC simulation and user jobs
    • 2017 data restripping should be started

  • Site Issues
    • NTR

12 February

  • Activity
    • HLT farm fully running
    • MC simulation and user jobs

  • Site Issues
    • CERN/T0 problem with updating DBOD - LHCbDirac was in downtime almost week

05 February

  • Activity
    • HLT farm fully running
    • 2016 data restripping, MC simulation and user jobs

29 January

  • Activity
    • HLT farm is partially running
    • 2016 data restripping, MC simulation and user jobs

  • Site Issues
    • CERN/T0
      • NTR
    • T1
      • RAL: Data transfers problem ALARM ticket (GGUS:133082); Solved.
      • IN2P3: Data transfers problem (GGUS:133081); Solved, but there was no reply on the ticket for two days.
      • SARA: No running jobs (GGUS:133089)

22 January

  • Activity
    • HLT farm "returning" from cooling maintenance.(no jobs running yet)
    • 2016 data restripping running full steam. Almost all data processed (waiting for CNAF)
    • Monte Carlo productions using remaining resources.

  • Meltdown & Spectre, several voboxes rebooting this week.

  • Site Issues
    • CERN/T0
      • NTR
    • T1
      • GRIDKA problems with FTS transfers(from and to) and "put and register". (fixed. Checked during meeting)

15 January

  • Activity
    • Running at maximum possible amount of resources. HLT farm stopped yesterday and returns "when cooling is stable again"
    • 2016 data restripping running full steam. Approx 1/2 of data processed (without CNAF) during YETS
    • Monte Carlo productions using remaining resources.

  • Meltdown & Spectre, performance hit after fix expected to be less cricital for data processing and monte carlo jobs (accounting for vast majority of work carried out).
    • voboxes patch: reboot will be tomorrow.

  • Site Issues
    • CERN/T0
      • NTR
    • T1
      • RRCKI problems with FTS transfers currently under investigation.
      • RAL had issues during weekend. "Burst"(jobs) reduced and all looks OK today.

8 January

  • Activity
    • Running at maximum possible amount of resources, including fully available HLT farm during YETS
    • 2016 data restripping running full steam. Approx 1/2 of data processed (without CNAF) during YETS
    • Monte Carlo productions using remaining resources.

  • Meltdown & Spectre, performance hit after fix expected to be less cricital for data processing and monte carlo jobs (accounting for vast majority of work carried out).
    • Need to patch voboxes, waiting for instructions from CERN

  • Site Issues
    • CERN/T0
      • ALARM ticket (GGUS:132628) for EOS transfer problems fixed internally by LHCb
    • T1
      • RRCKI problems with FTS transfers currently under investigation

18 December

  • Activity
    • Stripping validation, user analysis, MC

  • Site Issues
    • T1
      • RAL: problems with file upload (GGUS:132540) - possibly solved. Internal ticket opened about pilots killed at RAL (not by LHCb).
      • SARA : Waiting for end of downtime.
      • Missing files : RAW files found missing at RRCKI (recovered), PIC(recovered) and IN2P3 (under investigation).
    • CERN :
      • Brief downtime of multiple database services yesterday. Also possibly a similar issue last week too.
      • Staging failures (GGUS:132516) - we hope that the 3day timeout request is not for long term.
      • Missing files on tape (GGUS:132525) - solved?

11 December

  • Activity
    • Stripping validation, user analysis, MC

  • Site Issues
    • T1
      • RAL: problems with file download from Castor (GGUS:132356)
      • RRC-KI: Downtime for the tape storage update, should be finished now
      • FZK: foreseen network maintenance on the 12 Dec, expect possible temporary connectivity problems; temporary file inavailability due to disk pool migration (should be mostly transparent for the users)

27 November

  • General
    • almost no free disk space left, still waiting for complete disk pledged deployment 2017

  • Activity
    • Stripping validation, user analysis, MC

  • Site Issues
    • T1
      • SARA: problems with transfers today (GGUS:132067), no longer observing
      • RRC-KI: problems with file access, reported as fixed
      • FZK: one WN without CVMFS (GGUS:132064), solved close to instantly

20 November

  • Activity
    • Stripping validation, user analysis, MC

  • Site Issues
    • NTR

13 November

  • Activity
    • New round of stripping validation before launching the campaign.

  • Site Issues
    • INFN-T1:
      • Several issues b/c of the site outage in all areas of experiment distributed computing. Currently working on an analysis of the situation also in view of upcoming data processing campaigns.

1 November

  • Activity
    • Monte Carlo simulation, data processing and user analysis
    • Validation for restripping completed; waiting responce

  • Site Issues
    • T0:
      • User had incorrect mapping at EOS; fixed

    • T1:
      • NTR

30 October

  • Activity
    • Monte Carlo simulation, data processing and user analysis
    • pre-staging progressing well

  • Site Issues
    • T0:
      • NTR

23 October

  • Activity
    • Monte Carlo simulation, data processing and user analysis
    • pre-staging approx 50% complete, progressing well

  • Site Issues
    • T0:
      • NTR

    • T1:
      • NTR

16 October

  • Activity
    • Monte Carlo simulation, data processing and user analysis
    • pre-staging of 2015 data for reprocessing progressing well, ~ 1/3 of data on disk buffers

  • Site Issues
    • T1:
      • INFN-T1 tape buffer running full, fixed by site admins
      • RAL disk server down with effects on production workflows

  • Aob
    • Request grid wide deployment of latest HepOSlibs meta-rpm, including deployment of git client.

09 October

  • Activity
    • Monte Carlo simulation, data processing and user analysis
    • pre-staging of 2015 data for reprocessing is started and will continue during weeks.

  • Site Issues

    • T1:
      • Failures in transfers to and from RRCKI over the weekend, solved now.
      • NL-T1 worker nodes in downtime tomorrow and Wednesday.
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946); Problem probably in SARA connection to LHCOne

02 October

  • Activity
    • Monte Carlo simulation, data processing and user analysis
    • pre-staging of 2015 data for reprocessing is started and will continue during weeks.

  • Site Issues

    • T1:
      • Failures in transfers to and from GRIDKA (GGUS:130848); This was due heavy load on dCache. It is stable now.
      • Files uploads and downloads failure at CNAF, due to hardware failure, which already fixed.
      • Missing expatbuilder at NIKHEF-ELPROD (GGUS:130832); solved
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946); Problem probably in SARA connection to LHCOne

    • T2:
      • Problems with pilots failing to contact LHCb services at CERN from WNs at Liverpool (GGUS:130715); solved

25 September

  • Activity
    • Monte Carlo simulation, data processing and user analysis (running more than 100K jobs)

  • Site Issues

    • T1:
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946); Problem probably in SARA connection to LHCOne
      • Problem downloading from SARA (GGUS:130692). Solved promptly by SARA - thanks. Problem with stuck dCache space manager

    • T2:
      • Problems with pilots failing to contact LHCb services at CERN from WNs at Liverpool (GGUS:130715)

18 September

  • Activity
    • Monte Carlo simulation, data processing and user analysis (running more than 100K jobs)

  • Site Issues

    • T1:
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946); "Geant have confirmed that they are unable to ping mouse1.grid.sara.nl from geant-lhcone-gw.mx1.lon.uk.geant.net"

11 September

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues

    • T1:
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946); "Geant have confirmed that they are unable to ping mouse1.grid.sara.nl from geant-lhcone-gw.mx1.lon.uk.geant.net"
      • Access file problem at GRIDKA (GGUS:130478)

4 September

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues
    • T0:
      • Incomplete python installation at worker nodes (GGUS:130018)
    • T1:
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946); no news
      • Failed transfers from many sites to dCache sites, see (GGUS:130190); Resolved by using proper parameter in SRM
      • We have peak of failed transfers at EOS every day at 5:00 (GGUS:130335)

28 Aug (Monday)

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues
    • T0:
      • Incomplete python installation at worker nodes (GGUS:130018)
    • T1:
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946)
      • Failed transfers from many sites to dCache sites, see (GGUS:130190)

21 Aug (Monday)

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues
    • T0:
      • Problem with EOS in the night between fri and sat (GGUS:130137). Become an alarm in the morning of sat. Fixed now, 3/7 grid-ftp doors were mis-behaving

    • T1:
      • NTR

14 Aug (Monday)

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues
    • T0:
      • Problem with installation of python possibly broken on multiple WNs (GGUS:130018) - ongoing issue

    • T1:
      • Problems uploading to various SEs - For SARA, tracked in GGUS:129946. Now also seen in FZK, IN2P3 and PIC - to be tracked and tickets opened if needed.

7 Aug (Monday)

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues
    • T0:
      • Key VO Box (lbvobox103) unavailable (lost?) due to hypervisor problem. (GGUS: 129942). No GGUS (or Service Now) updates since yesterday morning. Having to recreate services on other VO Boxes.
    • T1:
      • NTR

31 Jul (Monday)

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues
    • T0:
      • NTR
    • T1:
      • NTR

17 Jul (Monday)

  • Activity
    • Lots of user analysis(some failling) and Monte Carlo simulation

  • Site Issues
    • T0: Jobs hold at HTCondor CEs CERN-PROD (GGUS:129147)
    • T1:
      • NTR

10 Jul (Monday)

  • Activity
    • User analysis and Monte Carlo simulation

  • Site Issues
    • T0: Jobs hold at HTCondor CEs CERN-PROD (GGUS:129147)
    • T1:
      • NTR

03 Jul (Monday)

  • Activity
    • User analysis and Monte Carlo simulation

  • Site Issues
    • T0: Jobs hold at HTCondor CEs CERN-PROD (GGUS:129147)
    • T1:
      • IN2P3: Downloads and Uploads issues during weekend, fixed

26 Jun (Monday)

  • Activity
    • User analysis and Monte Carlo simulation

  • Site Issues

29 May (Monday)

  • Activity
    • User analysis and Monte Carlo simulation

  • Site Issues
    • T0:
      • Network/DNS outage yesterday caused problems for a few hours. All recovered now.

29 May (Monday)

  • Activity
    • User analysis and Monte Carlo simulation
    • Stripping v24 is almost finished.

  • Site Issues

22 May (Monday)

  • Activity
    • User analysis and Monte Carlo simulation
    • New validation of Stripping v24 has been started.

  • Site Issues
    • T1:
      • RAL: disk server failures during the weekend

15 May (Monday)

  • Activity
    • Stripping v24 waiting for developers. Almost 100k jobs running

  • Site Issues
    • T0:
      • EOS downtime this morning.

8 May (Monday)

  • Activity
    • Stripping v28 over, Stripping v24 waiting for developers.

  • Site Issues
    • T1:
      • RAL: Disk server failure last week, back in production today. Some FTS3 timeouts during staging.

24 April (Monday)

  • Activity
    • MC Simulation, Data Stripping and user analysis

  • Site Issues
    • T0: Some ongoing problems with Condor CEs (GGUS:127553)
    • T1:
      • RAL: Still running with a limit on the number of Merge jobs to avoid problems with storage (GGUS:127617). Hoping these problems will be fixed by the CASTOR upgrade a week on Wednesday

18 April (Tuesday)

  • Activity
    • MC Simulation, Data Stripping and user analysis
    • Staging campaigns are ongoing for Data Stripping.

  • Site Issues
    • T0: SRM problems fixed quickly last week (GGUS:127638)
    • T1:
      • CNAF: Uploading problems over the weekend fixed (GGUS: 127728) Due to another VO's GPFS usage pattern.
      • RAL: running with a limit on the number of Merge jobs to avoid problems with storage (GGUS:127617) but better than the situation before the version downgrade.
      • RRCKI: running with a limit on the number of user jobs due to limits on concurrent open files in dCache (no GGUS for this)
  • * T2: Seeing SL6.9 openssl problems at several sites. Tickets issued.

10 April (Monday)

  • Activity
    • MC Simulation, Data Stripping and user analysis
    • Staging campaigns are ongoing for Data Stripping.

  • Site Issues
    • T0:
      • Transfer errors from the job: could not open connection to srm-eoslhcb.cern.ch (GGUS:127638)

    • T1:
      • RAL: two alarm tickets are opened during the weekend:
      • CNAF: failed contact to the SRM: could not open connection to storm-fe-lhcb.cr.cnaf.infn.it:8444 (GGUS:127608)

03 April (Monday)

  • Activity
    • MC Simulation, Stripping
    • Staging campaign for Stripping27, Stripping28 and Stripping24b, as well as 2015 EM should take 6 to 7 weeks with peaks of staging.

  • Site Issues
    • T0:
      • The 3 gridftps doors were saturated. Added 2 new one.

    • T1:
      • RAL: suffering huge issue with SRM. Under investigation
      • CNAF: Stager was blocked for a while
      • FZK: Seem to have found a somewhat correct balance between timeouts and performance for transfers

27 March (Monday)

  • Activity
    • MC Simulation, Stripping
    • Staging campaign for Stripping27, Stripping28 and Stripping24b should take 6 to 7 weeks with peaks of staging.

  • Site Issues

    • T1:
      • RAL: disk server gdss780 is currently unavailable.
      • CNAF: Added an additional drive for staging.
      • PIC: LTO5 drive is supposed to be replaced today. could be slower than usual.
      • FZK: FTS transfers fail (GGUS:127301). Under investigation.

20 March (Monday)

  • Activity
    • MC Simulation, Stripping
    • Database backup locking and long queries from us on Friday caused severe distribution to LHCb production management system over weekend and into today, both for data and MC. A lot of manual work has been done to resolve inconsistencies.

  • Site Issues
    • T0:
      • ALARM ticket GGUS:126874 about users running out of AUP signature validity. User AUP validity overwritten with admin rights. No update since 9th March.
      • GGUS: 127148 has jobs being killed (rather than just limited by cgroups) when using more than 2GB of physical memory when there is contention. LHCb VO ID card requests 4GB of virtual memory and jobs typically work with significantly less than 2GB RSS for almost all of their duration.
    • T1:
      • FZK: Some ongoing issues with submission timeouts to the new ARC CEs with arc-2-kit not working at all (GGUS:127075). Also GGUS:127122 with transfer timeouts causing lots of queued transfers in our production system.
      • CNAF: GGUS: 127129 had a number of file transfer failures but this problem seems to be ok now. We have also had files which appear to have been transferred successfully but aren't there in reality, but this appears to be a consequence of the database problems we had on Friday rather than due to CNAF.

13 March (Monday)

  • Activity
    • MC Simulation, Stripping campaign now started so tape systems will start to be hit

  • Site Issues
    • T0:
      • ALARM ticket GGUS:126874 about users running out of AUP signature validity. User AUP validity overwritten with admin rights. No updates since 2nd March - any more news on fixes?
      • Ready for network outages on Wednesday morning - Thanks for shifting the DT from 22nd to 15th as well!

    • T1:
      • FZK: Some ongoing issues with submission timeouts to the new ARC CEs with arc-2-kit not working at all (GGUS:127075)

6 March (Monday)

  • Activity
    • MC Simulation, Stripping campaign to start this week which will increase load on T1 tape systems

  • Site Issues
    • T0:
      • Wed: ALARM ticket GGUS:126874 about users running out of AUP signature validity. User AUP validity overwritten with admin rights.
      • Observed CVMFS failures on batch and cloud machines (GGUS:126876). Failure rate decreased now

    • T1:
      • SARA: SRM problems over the week-end (GGUS:126937). Currently cannot test if fixed b/c site is in DT
      • FZK: Switching to ARC-CEs only. Last week sw update for ARC-CEs produced failures (GGUS:126882). CREAM-CE submission already stopped from LHCb side.
      • PIC: Currently in DT for dCache upgrade. Batch closed but CEs open --> produces aborted pilots on LHCb side.

27th February (Monday)

  • Activity
    • MC Simulation, user analysis and data reconstruction jobs

  • Site Issues
    • T0:
      • Some settings were changed at EOS SRM and should have fixed last week's problem
      • Intervention on LHCb offline production database (LHCBR) to new hardware Wednesday 01/03/2017 from 10am to 12pm

20th February (Monday)

  • Activity
    • MC Simulation, user analysis and data reconstruction jobs

  • Site Issues
    • NTR

13th February (Monday)

  • Activity
    • MC Simulation, user analysis and data reconstruction jobs

  • Site Issues
    • T0:
      • Second instance of SRM for EOS LHCb is in production. Original EOS SRM reports zero for available space time from time.
    • T1:
      • CNAF: Downtime for 3 days
      • SARA: Downtime tomorrow (1 hour)

6th February (Monday)

  • Activity
    • MC Simulation and user analysis; reco jobs starting again

  • Site Issues
    • T0:
      • Major problems with SRM for LHCb use of EOS, unable to use it for most of the weekend (GGUS:126378). This led to loss of the results from 10,000s of jobs on HLT farm as it only connects to CERN. Appears to be resolved now: initial overloading led to avalanche of failures and retries, all increasing the load. Looking at ways to avoid this with IT and within LHCb.
    • T1:

30th January (Monday)

  • Activity
    • MC Simulation and user analysis

-++ 23rd January (Monday)

  • Activity
    • Mainly running simulation on grid only resources, HLT back and running ~10K jobs. ~67K jobs total

  • Site Issues
    • T0:
      • EOSLHCB "very slow" via SRM last week. Back to normal (GGUS:126037)
    • T1:
      • RAL: Storage in Downtime today from 10:30 to 12:30
      • CNAF: announced a Downtime on 13th and 14th February for changing of core switch

16th January (Monday)

  • Activity
    • Mainly running simulation on grid only resources, HLT off b/c of maintenance

  • Site Issues
    • T0:
      • The LHCb internal name of the T0 batch resources has been renamed from LCG.CERN.ch to LCG.CERN.cern (to distinguish it from other .ch resources)
    • T1:
      • RAL: file access issue: User can not open files(GGUS:125856), will be followed up after the Wed DT

9th January (Monday)

  • Activity
    • very high activity during the Christmas break: running more than100k jobs (new record for LHCb!)
    • Data reconstruction (proton-ion) almost finished, MC and user analysis.

  • Site Issues
    • T0:
      • NTR
    • T1:
      • Transfer problem from GRIDKA to CBPF (GGUS:125789)
      • RAL: file access issue: User can not open files(GGUS:125856).

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng 2010-08-17_Transfer_Errors.png r1 manage 82.0 K 2010-08-18 - 14:11 VladimirRomanovsky  
PNGpng 2010-08-17_Transfer_Spike.png r1 manage 91.2 K 2010-08-18 - 14:11 VladimirRomanovsky  
PNGpng 2010-08-17_Transfer_Succeed.png r1 manage 92.0 K 2010-08-18 - 14:12 VladimirRomanovsky  
PNGpng 2010-08-17_Transfer_Succeed_1Week_SARA_CNAF.png r1 manage 62.0 K 2010-08-18 - 14:12 VladimirRomanovsky  
PNGpng 24hoursatRAL.png r1 manage 73.7 K 2010-09-03 - 11:58 UnknownUser  
PNGpng AFS_availability.png r1 manage 2.5 K 2009-11-13 - 09:27 UnknownUser  
PDFpdf Alarm_ticket_test_1st_of_October.pdf r4 r3 r2 r1 manage 38.4 K 2009-10-02 - 15:44 UnknownUser  
PDFpdf Alarm_ticket_test_8th_of_October.pdf r2 r1 manage 41.0 K 2009-10-08 - 13:56 UnknownUser  
PDFpdf Analysis_at_Tier1s.pdf r1 manage 580.7 K 2009-05-04 - 14:12 RobertoSantinel  
PNGpng CERN_24h.png r1 manage 73.4 K 2010-11-10 - 09:54 UnknownUser  
PNGpng CNAF-M-DST.png r1 manage 3.0 K 2009-09-02 - 14:35 UnknownUser  
PNGpng GRIDKA-LHCb_MC_M-DST.png r1 manage 2.8 K 2009-08-06 - 13:42 RobertoSantinel  
PNGpng LFC.png r1 manage 2.9 K 2009-06-11 - 12:08 RobertoSantinel  
PNGpng Last24MC.png r1 manage 72.0 K 2010-10-19 - 17:11 UnknownUser  
PNGpng Manchester.png r1 manage 44.6 K 2009-09-15 - 14:01 UnknownUser  
PNGpng PIC-LHCb_MC_M-DST.png r1 manage 3.0 K 2009-08-06 - 13:41 RobertoSantinel  
PNGpng PIC-MC-M-DST.png r1 manage 3.0 K 2009-09-02 - 14:24 UnknownUser  
PNGpng QMUL.png r1 manage 44.5 K 2009-09-15 - 14:02 UnknownUser  
PNGpng Running_jobs.png r1 manage 16.0 K 2009-07-17 - 09:52 RobertoSantinel  
PNGpng SARA-LHCb_MC_M-DST.png r1 manage 3.0 K 2009-08-06 - 13:42 RobertoSantinel  
PNGpng SLS_replication.png r1 manage 0.9 K 2010-02-19 - 14:38 UnknownUser  
PNGpng SVN.png r1 manage 2.9 K 2010-03-18 - 12:16 UnknownUser  
PNGpng Transfer_throughput.png r1 manage 72.2 K 2009-08-04 - 13:53 RobertoSantinel  
Unknown file formatdocx UK_sites_issue.docx r1 manage 122.0 K 2010-07-01 - 12:00 UnknownUser  
PNGpng activity.png r1 manage 71.2 K 2010-05-16 - 14:07 UnknownUser  
PNGpng castor_queued_transfers.png r1 manage 3.4 K 2010-05-27 - 12:09 GreigCowan  
GIFgif castor_raw.gif r1 manage 16.0 K 2009-12-07 - 10:53 UnknownUser  
PNGpng ce124.PNG r1 manage 39.0 K 2009-06-24 - 15:04 RobertoSantinel  
PNGpng firstpassjobs.png r1 manage 57.6 K 2009-06-19 - 11:49 RobertoSantinel  
PNGpng fromCERN.png r2 r1 manage 48.9 K 2009-05-13 - 11:20 RobertoSantinel  
PNGpng fromCERN2.png r1 manage 48.9 K 2009-05-13 - 11:21 RobertoSantinel  
PNGpng fromPIT.png r1 manage 104.4 K 2010-05-16 - 15:21 UnknownUser  
PNGpng getPlotImg-4.png r1 manage 73.3 K 2010-05-28 - 10:27 UnknownUser  
PNGpng getPlotImg.png r1 manage 54.2 K 2010-05-05 - 12:15 UnknownUser  
PNGpng gridka.png r1 manage 2.9 K 2009-10-19 - 13:46 UnknownUser  
PNGpng gridview.png r1 manage 9.9 K 2010-07-26 - 12:26 UnknownUser  
Bitmapbmp jobs_MC09.bmp r1 manage 1171.9 K 2009-05-07 - 14:05 RobertoSantinel  
PNGpng jobs_NIKHEF.png r1 manage 49.1 K 2009-05-11 - 12:24 RobertoSantinel  
PNGpng jobs_running.png r1 manage 83.7 K 2009-10-08 - 14:30 UnknownUser  
PNGpng last_24_hs_MC_activity.png r1 manage 63.4 K 2010-02-05 - 10:53 UnknownUser  
GIFgif lhcb_castor_0_0_PEND_RUNSTACKEDP_1.gif r1 manage 16.1 K 2009-11-06 - 15:49 UnknownUser  
PNGpng lhcb_raw.png r1 manage 21.4 K 2009-10-01 - 10:37 UnknownUser  
PNGpng mdst_transfers.png r1 manage 7.3 K 2010-05-25 - 09:55 UnknownUser  
PNGpng network_lhcbraw.png r1 manage 19.6 K 2009-09-30 - 14:41 UnknownUser  
JPEGjpg nice_event.jpg r1 manage 468.6 K 2009-12-14 - 14:39 UnknownUser  
PNGpng night_xfer.png r1 manage 8.1 K 2009-11-13 - 09:00 UnknownUser  
PNGpng pilot_last_day.png r1 manage 52.5 K 2009-09-11 - 10:26 UnknownUser  
JPEGjpg quality_transfers.JPG r1 manage 72.9 K 2009-05-11 - 12:20 RobertoSantinel  
PNGpng quality_transfers.png r1 manage 58.1 K 2009-05-11 - 12:17 RobertoSantinel  
PNGpng queued_tranfers.png r1 manage 3.3 K 2010-02-24 - 14:02 UnknownUser  
PNGpng queuedmdst.png r1 manage 10.7 K 2010-05-27 - 12:15 UnknownUser  
PNGpng reprocessing_jobs.png r1 manage 60.1 K 2009-12-16 - 09:34 UnknownUser  
PNGpng rump_up_18th.png r1 manage 81.1 K 2009-10-19 - 10:06 UnknownUser  
PNGpng running_jobs.png r1 manage 64.3 K 2009-07-29 - 14:00 RobertoSantinel  
PNGpng sara_ral.png r1 manage 232.7 K 2010-05-18 - 10:41 UnknownUser  
PNGpng throughputSTEP09.png r1 manage 53.5 K 2009-06-09 - 11:59 RobertoSantinel  
JPEGjpg throughput_real_data.jpg r1 manage 203.0 K 2010-05-11 - 12:09 UnknownUser  
PNGtiff throughput_real_data.tiff r1 manage 105.4 K 2010-05-11 - 12:04 UnknownUser  
PNGpng totalrunning.png r1 manage 101.3 K 2009-10-22 - 12:03 UnknownUser  
Bitmapbmp unscheduled_in2p3.bmp r1 manage 3000.1 K 2010-02-01 - 11:47 UnknownUser  
PNGpng userjobstalled.png r1 manage 42.5 K 2009-10-01 - 14:25 UnknownUser  
PNGpng weekend16.png r1 manage 85.4 K 2009-10-19 - 10:14 UnknownUser  
PNGpng wms203-1.png r2 r1 manage 10.3 K 2009-10-19 - 10:09 UnknownUser  
PNGpng wms203_ICE.PNG r1 manage 29.2 K 2009-12-11 - 14:19 UnknownUser  
PNGpng xfers.png r1 manage 8.2 K 2009-11-13 - 14:07 UnknownUser  
PNGpng xfers_to_cern.png r1 manage 48.6 K 2010-02-22 - 12:07 UnknownUser  
Edit | Attach | Watch | Print version | History: r1986 < r1985 < r1984 < r1983 < r1982 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1986 - 2022-09-26 - MarkWSlater
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback