Daily WLCG Operations calls :: collection of LHCb reports

Starting from April 2009, this twiki collects all the LHCb reports given to the daily WLCG calls at 3pm Geneva time Mondays and Thursdays. These calls are attended by the LHCb Grid expert and/or Stefan Roiser (on behalf of LHCb). CONNECTION details: For remote participation we use the Vidyo system. Instructions can be found here (Deprecated : Alcatel URL). These reports have to be duly compiled by the GEOC as part of his mandate

Previous reports are available per year:
2018 2017 2016 2015 2014 2013 2012 2011 2010 2009


Jump to a date:

3rd June 2019

  • Smooth running at ~100K jobs, Usual activity
    • User jobs, MC productions, and WG productions this week
  • Issues
    • RAL:
      • Timeouts when accesing job input data (GGUS:141462)
      • Auth failures for accesing files by user jobs (GGUS:141262)
    • CERN:
      • Poor transfer efficiency from CERN WN to outside storage GGUS:141112

27th May 2019

  • Smooth running at ~100K jobs, Usual activity
    • User jobs, MC productions, and WG productions this week
  • issues which are not significant, but potentially may be of interest to other experiments:
    • in progress: Poor transfer efficiency from CERN WN to outside storage GGUS:141112
    • Users getting <TNetXNGFile::Open>: [FATAL] Auth failed at RAL GGUS:141262

20th May 2019

  • Usual activity
    • User jobs, MC productions, and staging this week
  • no significant issues to report

13th May 2019

  • Activity
    • User jobs, MC productions, and staging this week
  • Issues

6th May 2019

  • Activity
    • User jobs, MC productions, and staging this week
  • Issues
    • CERN:
    • RAL:
      • Continuing migration from Castor to ECHO

29th April 2019

  • Activity
    • User jobs, MC productions, and staging this week
  • Issues
    • RAL:
      • Continuing migration from Castor to ECHO
    • IN2P3:
      • Unscheduled warning downtime this morning for Patch for NFS mount problem

15th April 2019

  • Activity
    • User jobs, MC productions, staging and some reprocessing this week.
  • Issues
    • RAL:
      • Continuing migration from Castor to ECHO
      • A disk server (gdss811) is down - causing various hold-ups and slow-downs of the different productions and the migration
    • PIC : Machine ran out of disk space (GGUS:140715) fixed now - thanks!
    • IN2P3 : Batch system issues (GGUS:140652) possibly ongoing

8th April 2019

  • Activity
    • User jobs, MC productions, staging and some reprocessing starting this week.

  • Issues
    • RAL:
      • A restart of docker killed a number of jobs last week. RAL investigating the course ( GGUS:140589)
      • A disk server was in a bad state that caused timeouts on opening some files (GGUS:140599)

1st April 2019

  • Activity
    • User jobs and MC productions

  • Issues
    • CERN: several tickets open:
    • PIC: All pilots failed. There was an error in the JobRouting definition in HTCondor-CE - solved ( GGUS:140482)

25th March 2019

  • Activity
    • User jobs and MC productions

  • Issues
    • CERN: several tickets open:
    • IN2P3: All data transfers Failed at IN2P3-CC, but problem solved and understood - it was due to the CRL update ( GGUS:140354)

18th March 2019

  • Activity
    • User jobs and MC productions

  • Issues
    • CERN: Some tickets for CERN are still open
    • IN2P3: downtime
    • CNAF: FTS3 transfers to QMUL

11th March 2019

  • Activity
    • User jobs and MC productions

  • Issues
    • CERN: Some tickets for CERN/EOS are still open, even thogh problems mostly gone. Not clear why.

4th March 2019

  • Activity
    • User jobs and MC production

  • Issues
    • CERN: Some ongoing EOS issues both writing and reading GGUS:139927
    • CERN/VOMS: Proxy renewal for SAM tests has stopped working. VOMS team investigating GGUS:139920 (Update: This looks like it is now fixed!)

25th February 2019

  • Activity
    • User jobs and MC production

  • Issues

18th February 2019

  • Activity
    • User jobs and MC product
    • Stripping s35

11th February 2019

  • Activity
    • User jobs and MC product
    • Stripping s35

  • Sites Issues

4th February 2019

  • Activity
    • User jobs and MC product
    • Stripping s35 and s35r1 for PbPb

28th January 2019

  • Activity
    • Data reconstruction for 2018 data on going
    • User jobs running and MC jobs at "full steam"
  • Sites Issues
    • CERN : NRT
    • Tier-1s : NTR

21st January 2019

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Sites Issues
    • CERN : NRT
    • Tier-1s : NTR

14th January 2019

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • CERN : (GGUS:139077) closed, thanks Jan. Reclcle bin "back to a nice safety margin".
    • RAL : Aborted pilots (GGUS:139081)

7 January 2019

  • Activity
    • Data reconstruction for 2018 data
    • User, WG processing and MC jobs
  • Site Issues
    • CERN : Curious to know the status of the CERN cloud T-systems (GGUS:139080), RHEA (GGUS:138848)
    • CERN : Also ran out of space in EOS "recycle-bin" (GGUS:139077) earlier today. Requesting a shorter retention period for now, before we decide on further measures
    • RAL : Aborted pilots (GGUS:139081)
Also a few other issues over the holiday period which were resolved either internally or through GGUS tickets.

17 December

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
    • Staging data for reprocessing in 2019
  • Site Issues
    • SARA: Ticket open during the weekend concerning tape migration issues, fastly fixed saturday night... Thanks a lot!
  • Thanks all Sites for this great year!

10 December

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
    • Staging data for reprocessing in 2019
  • Site Issues

03 December

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • SARA: Ticket open concerning data transfers problems (GGUS:138472) site waiting on CERN input

26 November

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • SARA: Ticket open concerning data transfers problems (GGUS:138472)

19 November

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • PIC: Data Access problems during the weekend, solved
    • GRIDKA: Downtime declared for tomorrow
    • SARA: Ticket open concerning data transfers problems (GGUS:138293)

12 November

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • CERN: spike in failed jobs on Sunday, currently investigating, FTS delegation issue (GGUS:138063)
    • SARA,IN2P3 FTS3 data transfer problem SARA <=> IN2P3 (GGUS:137967) (GGUS:137972)
    • RAL: FTS issues (server removed from Configuration) (GGUS:137822)
    • IN2P3: Decreased transfer efficiency (GGUS:137918)

5 November

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

29 October

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

22 October

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • RAL: FTS issues (server removed from Configuration) (GGUS:137822)

15 October

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • NTR

8 October

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • NTR

1 October

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

24 September

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

17 September

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

10 September

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • CERN: Pilot submission problem (GGUS:137037); Solved
    • CERN: Problem with accessing files (GGUS:137079)
    • CNAF: Minor problems at worker nodes

3 September

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • RAL: Failing disk server at RAL resulting in jobs failing to get input data

27 August

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • CERN: The reported problem with uploads to EOS via xrootd (GGUS 136720) is likely related to the LHCb bundled grid middleware, the fix is being tested
    • RAL: ipV6 connection problems resulting in failed FTS transfers (GGUS 136863)
    • RAL: Failing disk server at RAL resulting in jobs failing to get input data

13 August

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • NTR

06 August

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

30 July

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues

23 July

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • CERN: File transfers problems. Looks like it is related to a problematic FTS server. Under investigation ( GGUS:136275 )

16 July

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs
  • Site Issues
    • IN2P3: There's a ticket for file transfer errors. "Better now" but need to be investigated ( 136067 )
    • CNAF: Ticket opened (136120) for failing pilots; under investigation. Another ticket for file transfer errors, in progress(136123).

9 July

  • Activity
    • Data reconstruction for 2018 data
    • User and MC jobs

  • Site Issues
    • NTR

2 July

  • Activity
    • Data reconstruction for 2018 data, MC simulation, user jobs

25 June

  • Activity
    • Data reconstruction for 2018 data, MC simulation, user jobs

18 June

  • Activity
    • Data reconstruction for 2018 data, MC simulation, user jobs

11 June

  • Activity
    • Data reconstruction for 2018 data, MC simulation, user jobs

4 June

  • Activity
    • Data reconstruction for 2018 data

  • Site Issues
    • NTR

28 May

  • Activity
    • Data reconstruction for 2018 data

  • Site Issues
    • NIKHEF: Pilots Failed (GGUS:135325) during weekend; Fixed.
    • Most pilots at ce515.cern.ch finished "successfully" without matching jobs due to missing CVMFS.

30 April

  • Activity
    • HLT farm off for MC

  • Updates
    • NTR
  • Site Issues
    • CERN: Staging issues. Many Conection reset by peer on Castor (GGUS:134755), related to FTS proxy-renewal. Noticed from 25th, bump of failures also this morning.

23 April

  • Activity
    • HLT farm to be used for some more time in parallel with the trigger

  • Updates
    • Deploying LHCbDIRAC with GLUE2 support today

  • Site Issues
    • IN2P3: Some tape files lost (GGUS:134666), recopied from other sites.
    • PIC: Staging problems (GGUS:134667)
    • CNAF: LHCb completed data management actions after the long downtime.

16 April

  • Activity
    • HLT farm to be used for some more time in parallel with the trigger

  • Site Issues
    • SARA: data access problems (GGUS:134545) being worked on
    • CNAF: working with the site to resurrect last 60 files for the re-stripping

9 April

  • Activity
    • HLT farm fully running
    • 2017 data re-stripping almost 100% finished
    • Stripping 29 reprocessing is ongoing

  • Site Issues
    • SARA: Data transfer issues (GGUS:134451). Being tracked down as a CRL issue.
    • CNAF: IPv6 issues on one CE (GGUS:134456) already solved
    • IN2P3: xroord server maybe broken (GGUS: 134441)

19 March

  • Activity
    • HLT farm fully running
    • 2017 data re-stripping ongoing
    • Stripping 29 reprocessing is ongoing

  • Site Issues

12 March

  • Activity
    • HLT farm fully running
    • 2017 data re-stripping ongoing
    • Stripping 29 reprocessing is ongoing

  • Site Issues
    • CNAF: coming back to life, but storage not working since Sunday evening

  • Tier2D
    • Users with UK certificate problems solved by upgrading xrootd server

05 March

  • Activity
    • HLT farm fully running
    • 2017 data re-stripping ongoing
    • Stripping 29 reprocessing is ongoing

  • Tier2D
    • Users with UK certificate are having problem to access data at CBPF, Glasgow, CSCS, NCBJ (3 DPM, 1 dCache) GGUS:133667, GGUS:133617

26 February

  • Activity
    • HLT farm fully running after dip over the weekend
    • MC simulation and user jobs
    • 2017 data restripping ongoing
    • Started stripping 29 reprocessing

  • Site Issues
    • NTR

19 February

  • Activity
    • HLT farm fully running
    • MC simulation and user jobs
    • 2017 data restripping should be started

  • Site Issues
    • NTR

12 February

  • Activity
    • HLT farm fully running
    • MC simulation and user jobs

  • Site Issues
    • CERN/T0 problem with updating DBOD - LHCbDirac was in downtime almost week

05 February

  • Activity
    • HLT farm fully running
    • 2016 data restripping, MC simulation and user jobs

29 January

  • Activity
    • HLT farm is partially running
    • 2016 data restripping, MC simulation and user jobs

  • Site Issues
    • CERN/T0
      • NTR
    • T1
      • RAL: Data transfers problem ALARM ticket (GGUS:133082); Solved.
      • IN2P3: Data transfers problem (GGUS:133081); Solved, but there was no reply on the ticket for two days.
      • SARA: No running jobs (GGUS:133089)

22 January

  • Activity
    • HLT farm "returning" from cooling maintenance.(no jobs running yet)
    • 2016 data restripping running full steam. Almost all data processed (waiting for CNAF)
    • Monte Carlo productions using remaining resources.

  • Meltdown & Spectre, several voboxes rebooting this week.

  • Site Issues
    • CERN/T0
      • NTR
    • T1
      • GRIDKA problems with FTS transfers(from and to) and "put and register". (fixed. Checked during meeting)

15 January

  • Activity
    • Running at maximum possible amount of resources. HLT farm stopped yesterday and returns "when cooling is stable again"
    • 2016 data restripping running full steam. Approx 1/2 of data processed (without CNAF) during YETS
    • Monte Carlo productions using remaining resources.

  • Meltdown & Spectre, performance hit after fix expected to be less cricital for data processing and monte carlo jobs (accounting for vast majority of work carried out).
    • voboxes patch: reboot will be tomorrow.

  • Site Issues
    • CERN/T0
      • NTR
    • T1
      • RRCKI problems with FTS transfers currently under investigation.
      • RAL had issues during weekend. "Burst"(jobs) reduced and all looks OK today.

8 January

  • Activity
    • Running at maximum possible amount of resources, including fully available HLT farm during YETS
    • 2016 data restripping running full steam. Approx 1/2 of data processed (without CNAF) during YETS
    • Monte Carlo productions using remaining resources.

  • Meltdown & Spectre, performance hit after fix expected to be less cricital for data processing and monte carlo jobs (accounting for vast majority of work carried out).
    • Need to patch voboxes, waiting for instructions from CERN

  • Site Issues
    • CERN/T0
      • ALARM ticket (GGUS:132628) for EOS transfer problems fixed internally by LHCb
    • T1
      • RRCKI problems with FTS transfers currently under investigation

18 December

  • Activity
    • Stripping validation, user analysis, MC

  • Site Issues
    • T1
      • RAL: problems with file upload (GGUS:132540) - possibly solved. Internal ticket opened about pilots killed at RAL (not by LHCb).
      • SARA : Waiting for end of downtime.
      • Missing files : RAW files found missing at RRCKI (recovered), PIC(recovered) and IN2P3 (under investigation).
    • CERN :
      • Brief downtime of multiple database services yesterday. Also possibly a similar issue last week too.
      • Staging failures (GGUS:132516) - we hope that the 3day timeout request is not for long term.
      • Missing files on tape (GGUS:132525) - solved?

11 December

  • Activity
    • Stripping validation, user analysis, MC

  • Site Issues
    • T1
      • RAL: problems with file download from Castor (GGUS:132356)
      • RRC-KI: Downtime for the tape storage update, should be finished now
      • FZK: foreseen network maintenance on the 12 Dec, expect possible temporary connectivity problems; temporary file inavailability due to disk pool migration (should be mostly transparent for the users)

27 November

  • General
    • almost no free disk space left, still waiting for complete disk pledged deployment 2017

  • Activity
    • Stripping validation, user analysis, MC

  • Site Issues
    • T1
      • SARA: problems with transfers today (GGUS:132067), no longer observing
      • RRC-KI: problems with file access, reported as fixed
      • FZK: one WN without CVMFS (GGUS:132064), solved close to instantly

20 November

  • Activity
    • Stripping validation, user analysis, MC

  • Site Issues
    • NTR

13 November

  • Activity
    • New round of stripping validation before launching the campaign.

  • Site Issues
    • INFN-T1:
      • Several issues b/c of the site outage in all areas of experiment distributed computing. Currently working on an analysis of the situation also in view of upcoming data processing campaigns.

1 November

  • Activity
    • Monte Carlo simulation, data processing and user analysis
    • Validation for restripping completed; waiting responce

  • Site Issues
    • T0:
      • User had incorrect mapping at EOS; fixed

    • T1:
      • NTR

30 October

  • Activity
    • Monte Carlo simulation, data processing and user analysis
    • pre-staging progressing well

  • Site Issues
    • T0:
      • NTR

23 October

  • Activity
    • Monte Carlo simulation, data processing and user analysis
    • pre-staging approx 50% complete, progressing well

  • Site Issues
    • T0:
      • NTR

    • T1:
      • NTR

16 October

  • Activity
    • Monte Carlo simulation, data processing and user analysis
    • pre-staging of 2015 data for reprocessing progressing well, ~ 1/3 of data on disk buffers

  • Site Issues
    • T1:
      • INFN-T1 tape buffer running full, fixed by site admins
      • RAL disk server down with effects on production workflows

  • Aob
    • Request grid wide deployment of latest HepOSlibs meta-rpm, including deployment of git client.

09 October

  • Activity
    • Monte Carlo simulation, data processing and user analysis
    • pre-staging of 2015 data for reprocessing is started and will continue during weeks.

  • Site Issues

    • T1:
      • Failures in transfers to and from RRCKI over the weekend, solved now.
      • NL-T1 worker nodes in downtime tomorrow and Wednesday.
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946); Problem probably in SARA connection to LHCOne

02 October

  • Activity
    • Monte Carlo simulation, data processing and user analysis
    • pre-staging of 2015 data for reprocessing is started and will continue during weeks.

  • Site Issues

    • T1:
      • Failures in transfers to and from GRIDKA (GGUS:130848); This was due heavy load on dCache. It is stable now.
      • Files uploads and downloads failure at CNAF, due to hardware failure, which already fixed.
      • Missing expatbuilder at NIKHEF-ELPROD (GGUS:130832); solved
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946); Problem probably in SARA connection to LHCOne

    • T2:
      • Problems with pilots failing to contact LHCb services at CERN from WNs at Liverpool (GGUS:130715); solved

25 September

  • Activity
    • Monte Carlo simulation, data processing and user analysis (running more than 100K jobs)

  • Site Issues

    • T1:
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946); Problem probably in SARA connection to LHCOne
      • Problem downloading from SARA (GGUS:130692). Solved promptly by SARA - thanks. Problem with stuck dCache space manager

    • T2:
      • Problems with pilots failing to contact LHCb services at CERN from WNs at Liverpool (GGUS:130715)

18 September

  • Activity
    • Monte Carlo simulation, data processing and user analysis (running more than 100K jobs)

  • Site Issues

    • T1:
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946); "Geant have confirmed that they are unable to ping mouse1.grid.sara.nl from geant-lhcone-gw.mx1.lon.uk.geant.net"

11 September

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues

    • T1:
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946); "Geant have confirmed that they are unable to ping mouse1.grid.sara.nl from geant-lhcone-gw.mx1.lon.uk.geant.net"
      • Access file problem at GRIDKA (GGUS:130478)

4 September

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues
    • T0:
      • Incomplete python installation at worker nodes (GGUS:130018)
    • T1:
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946); no news
      • Failed transfers from many sites to dCache sites, see (GGUS:130190); Resolved by using proper parameter in SRM
      • We have peak of failed transfers at EOS every day at 5:00 (GGUS:130335)

28 Aug (Monday)

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues
    • T0:
      • Incomplete python installation at worker nodes (GGUS:130018)
    • T1:
      • Failed transfers from IC to SARA (IPV6) (GGUS:129946)
      • Failed transfers from many sites to dCache sites, see (GGUS:130190)

21 Aug (Monday)

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues
    • T0:
      • Problem with EOS in the night between fri and sat (GGUS:130137). Become an alarm in the morning of sat. Fixed now, 3/7 grid-ftp doors were mis-behaving

    • T1:
      • NTR

14 Aug (Monday)

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues
    • T0:
      • Problem with installation of python possibly broken on multiple WNs (GGUS:130018) - ongoing issue

    • T1:
      • Problems uploading to various SEs - For SARA, tracked in GGUS:129946. Now also seen in FZK, IN2P3 and PIC - to be tracked and tickets opened if needed.

7 Aug (Monday)

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues
    • T0:
      • Key VO Box (lbvobox103) unavailable (lost?) due to hypervisor problem. (GGUS: 129942). No GGUS (or Service Now) updates since yesterday morning. Having to recreate services on other VO Boxes.
    • T1:
      • NTR

31 Jul (Monday)

  • Activity
    • Monte Carlo simulation, data processing and user analysis

  • Site Issues
    • T0:
      • NTR
    • T1:
      • NTR

17 Jul (Monday)

  • Activity
    • Lots of user analysis(some failling) and Monte Carlo simulation

  • Site Issues
    • T0: Jobs hold at HTCondor CEs CERN-PROD (GGUS:129147)
    • T1:
      • NTR

10 Jul (Monday)

  • Activity
    • User analysis and Monte Carlo simulation

  • Site Issues
    • T0: Jobs hold at HTCondor CEs CERN-PROD (GGUS:129147)
    • T1:
      • NTR

03 Jul (Monday)

  • Activity
    • User analysis and Monte Carlo simulation

  • Site Issues
    • T0: Jobs hold at HTCondor CEs CERN-PROD (GGUS:129147)
    • T1:
      • IN2P3: Downloads and Uploads issues during weekend, fixed

26 Jun (Monday)

  • Activity
    • User analysis and Monte Carlo simulation

  • Site Issues

29 May (Monday)

  • Activity
    • User analysis and Monte Carlo simulation

  • Site Issues
    • T0:
      • Network/DNS outage yesterday caused problems for a few hours. All recovered now.

29 May (Monday)

  • Activity
    • User analysis and Monte Carlo simulation
    • Stripping v24 is almost finished.

  • Site Issues

22 May (Monday)

  • Activity
    • User analysis and Monte Carlo simulation
    • New validation of Stripping v24 has been started.

  • Site Issues
    • T1:
      • RAL: disk server failures during the weekend

15 May (Monday)

  • Activity
    • Stripping v24 waiting for developers. Almost 100k jobs running

  • Site Issues
    • T0:
      • EOS downtime this morning.

8 May (Monday)

  • Activity
    • Stripping v28 over, Stripping v24 waiting for developers.

  • Site Issues
    • T1:
      • RAL: Disk server failure last week, back in production today. Some FTS3 timeouts during staging.

24 April (Monday)

  • Activity
    • MC Simulation, Data Stripping and user analysis

  • Site Issues
    • T0: Some ongoing problems with Condor CEs (GGUS:127553)
    • T1:
      • RAL: Still running with a limit on the number of Merge jobs to avoid problems with storage (GGUS:127617). Hoping these problems will be fixed by the CASTOR upgrade a week on Wednesday

18 April (Tuesday)

  • Activity
    • MC Simulation, Data Stripping and user analysis
    • Staging campaigns are ongoing for Data Stripping.

  • Site Issues
    • T0: SRM problems fixed quickly last week (GGUS:127638)
    • T1:
      • CNAF: Uploading problems over the weekend fixed (GGUS: 127728) Due to another VO's GPFS usage pattern.
      • RAL: running with a limit on the number of Merge jobs to avoid problems with storage (GGUS:127617) but better than the situation before the version downgrade.
      • RRCKI: running with a limit on the number of user jobs due to limits on concurrent open files in dCache (no GGUS for this)
  • * T2: Seeing SL6.9 openssl problems at several sites. Tickets issued.

10 April (Monday)

  • Activity
    • MC Simulation, Data Stripping and user analysis
    • Staging campaigns are ongoing for Data Stripping.

  • Site Issues
    • T0:
      • Transfer errors from the job: could not open connection to srm-eoslhcb.cern.ch (GGUS:127638)

    • T1:
      • RAL: two alarm tickets are opened during the weekend:
      • CNAF: failed contact to the SRM: could not open connection to storm-fe-lhcb.cr.cnaf.infn.it:8444 (GGUS:127608)

03 April (Monday)

  • Activity
    • MC Simulation, Stripping
    • Staging campaign for Stripping27, Stripping28 and Stripping24b, as well as 2015 EM should take 6 to 7 weeks with peaks of staging.

  • Site Issues
    • T0:
      • The 3 gridftps doors were saturated. Added 2 new one.

    • T1:
      • RAL: suffering huge issue with SRM. Under investigation
      • CNAF: Stager was blocked for a while
      • FZK: Seem to have found a somewhat correct balance between timeouts and performance for transfers

27 March (Monday)

  • Activity
    • MC Simulation, Stripping
    • Staging campaign for Stripping27, Stripping28 and Stripping24b should take 6 to 7 weeks with peaks of staging.

  • Site Issues

    • T1:
      • RAL: disk server gdss780 is currently unavailable.
      • CNAF: Added an additional drive for staging.
      • PIC: LTO5 drive is supposed to be replaced today. could be slower than usual.
      • FZK: FTS transfers fail (GGUS:127301). Under investigation.

20 March (Monday)

  • Activity
    • MC Simulation, Stripping
    • Database backup locking and long queries from us on Friday caused severe distribution to LHCb production management system over weekend and into today, both for data and MC. A lot of manual work has been done to resolve inconsistencies.

  • Site Issues
    • T0:
      • ALARM ticket GGUS:126874 about users running out of AUP signature validity. User AUP validity overwritten with admin rights. No update since 9th March.
      • GGUS: 127148 has jobs being killed (rather than just limited by cgroups) when using more than 2GB of physical memory when there is contention. LHCb VO ID card requests 4GB of virtual memory and jobs typically work with significantly less than 2GB RSS for almost all of their duration.
    • T1:
      • FZK: Some ongoing issues with submission timeouts to the new ARC CEs with arc-2-kit not working at all (GGUS:127075). Also GGUS:127122 with transfer timeouts causing lots of queued transfers in our production system.
      • CNAF: GGUS: 127129 had a number of file transfer failures but this problem seems to be ok now. We have also had files which appear to have been transferred successfully but aren't there in reality, but this appears to be a consequence of the database problems we had on Friday rather than due to CNAF.

13 March (Monday)

  • Activity
    • MC Simulation, Stripping campaign now started so tape systems will start to be hit

  • Site Issues
    • T0:
      • ALARM ticket GGUS:126874 about users running out of AUP signature validity. User AUP validity overwritten with admin rights. No updates since 2nd March - any more news on fixes?
      • Ready for network outages on Wednesday morning - Thanks for shifting the DT from 22nd to 15th as well!

    • T1:
      • FZK: Some ongoing issues with submission timeouts to the new ARC CEs with arc-2-kit not working at all (GGUS:127075)

6 March (Monday)

  • Activity
    • MC Simulation, Stripping campaign to start this week which will increase load on T1 tape systems

  • Site Issues
    • T0:
      • Wed: ALARM ticket GGUS:126874 about users running out of AUP signature validity. User AUP validity overwritten with admin rights.
      • Observed CVMFS failures on batch and cloud machines (GGUS:126876). Failure rate decreased now

    • T1:
      • SARA: SRM problems over the week-end (GGUS:126937). Currently cannot test if fixed b/c site is in DT
      • FZK: Switching to ARC-CEs only. Last week sw update for ARC-CEs produced failures (GGUS:126882). CREAM-CE submission already stopped from LHCb side.
      • PIC: Currently in DT for dCache upgrade. Batch closed but CEs open --> produces aborted pilots on LHCb side.

27th February (Monday)

  • Activity
    • MC Simulation, user analysis and data reconstruction jobs

  • Site Issues
    • T0:
      • Some settings were changed at EOS SRM and should have fixed last week's problem
      • Intervention on LHCb offline production database (LHCBR) to new hardware Wednesday 01/03/2017 from 10am to 12pm

20th February (Monday)

  • Activity
    • MC Simulation, user analysis and data reconstruction jobs

  • Site Issues
    • NTR

13th February (Monday)

  • Activity
    • MC Simulation, user analysis and data reconstruction jobs

  • Site Issues
    • T0:
      • Second instance of SRM for EOS LHCb is in production. Original EOS SRM reports zero for available space time from time.
    • T1:
      • CNAF: Downtime for 3 days
      • SARA: Downtime tomorrow (1 hour)

6th February (Monday)

  • Activity
    • MC Simulation and user analysis; reco jobs starting again

  • Site Issues
    • T0:
      • Major problems with SRM for LHCb use of EOS, unable to use it for most of the weekend (GGUS:126378). This led to loss of the results from 10,000s of jobs on HLT farm as it only connects to CERN. Appears to be resolved now: initial overloading led to avalanche of failures and retries, all increasing the load. Looking at ways to avoid this with IT and within LHCb.
    • T1:

30th January (Monday)

  • Activity
    • MC Simulation and user analysis

-++ 23rd January (Monday)

  • Activity
    • Mainly running simulation on grid only resources, HLT back and running ~10K jobs. ~67K jobs total

  • Site Issues
    • T0:
      • EOSLHCB "very slow" via SRM last week. Back to normal (GGUS:126037)
    • T1:
      • RAL: Storage in Downtime today from 10:30 to 12:30
      • CNAF: announced a Downtime on 13th and 14th February for changing of core switch

16th January (Monday)

  • Activity
    • Mainly running simulation on grid only resources, HLT off b/c of maintenance

  • Site Issues
    • T0:
      • The LHCb internal name of the T0 batch resources has been renamed from LCG.CERN.ch to LCG.CERN.cern (to distinguish it from other .ch resources)
    • T1:
      • RAL: file access issue: User can not open files(GGUS:125856), will be followed up after the Wed DT

9th January (Monday)

  • Activity
    • very high activity during the Christmas break: running more than100k jobs (new record for LHCb!)
    • Data reconstruction (proton-ion) almost finished, MC and user analysis.

  • Site Issues
    • T0:
      • NTR
    • T1:
      • Transfer problem from GRIDKA to CBPF (GGUS:125789)
      • RAL: file access issue: User can not open files(GGUS:125856).

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng 2010-08-17_Transfer_Errors.png r1 manage 82.0 K 2010-08-18 - 14:11 VladimirRomanovsky  
PNGpng 2010-08-17_Transfer_Spike.png r1 manage 91.2 K 2010-08-18 - 14:11 VladimirRomanovsky  
PNGpng 2010-08-17_Transfer_Succeed.png r1 manage 92.0 K 2010-08-18 - 14:12 VladimirRomanovsky  
PNGpng 2010-08-17_Transfer_Succeed_1Week_SARA_CNAF.png r1 manage 62.0 K 2010-08-18 - 14:12 VladimirRomanovsky  
PNGpng 24hoursatRAL.png r1 manage 73.7 K 2010-09-03 - 11:58 UnknownUser  
PNGpng AFS_availability.png r1 manage 2.5 K 2009-11-13 - 09:27 UnknownUser  
PDFpdf Alarm_ticket_test_1st_of_October.pdf r4 r3 r2 r1 manage 38.4 K 2009-10-02 - 15:44 UnknownUser  
PDFpdf Alarm_ticket_test_8th_of_October.pdf r2 r1 manage 41.0 K 2009-10-08 - 13:56 UnknownUser  
PDFpdf Analysis_at_Tier1s.pdf r1 manage 580.7 K 2009-05-04 - 14:12 RobertoSantinel  
PNGpng CERN_24h.png r1 manage 73.4 K 2010-11-10 - 09:54 UnknownUser  
PNGpng CNAF-M-DST.png r1 manage 3.0 K 2009-09-02 - 14:35 UnknownUser  
PNGpng GRIDKA-LHCb_MC_M-DST.png r1 manage 2.8 K 2009-08-06 - 13:42 RobertoSantinel  
PNGpng LFC.png r1 manage 2.9 K 2009-06-11 - 12:08 RobertoSantinel  
PNGpng Last24MC.png r1 manage 72.0 K 2010-10-19 - 17:11 UnknownUser  
PNGpng Manchester.png r1 manage 44.6 K 2009-09-15 - 14:01 UnknownUser  
PNGpng PIC-LHCb_MC_M-DST.png r1 manage 3.0 K 2009-08-06 - 13:41 RobertoSantinel  
PNGpng PIC-MC-M-DST.png r1 manage 3.0 K 2009-09-02 - 14:24 UnknownUser  
PNGpng QMUL.png r1 manage 44.5 K 2009-09-15 - 14:02 UnknownUser  
PNGpng Running_jobs.png r1 manage 16.0 K 2009-07-17 - 09:52 RobertoSantinel  
PNGpng SARA-LHCb_MC_M-DST.png r1 manage 3.0 K 2009-08-06 - 13:42 RobertoSantinel  
PNGpng SLS_replication.png r1 manage 0.9 K 2010-02-19 - 14:38 UnknownUser  
PNGpng SVN.png r1 manage 2.9 K 2010-03-18 - 12:16 UnknownUser  
PNGpng Transfer_throughput.png r1 manage 72.2 K 2009-08-04 - 13:53 RobertoSantinel  
Unknown file formatdocx UK_sites_issue.docx r1 manage 122.0 K 2010-07-01 - 12:00 UnknownUser  
PNGpng activity.png r1 manage 71.2 K 2010-05-16 - 14:07 UnknownUser  
PNGpng castor_queued_transfers.png r1 manage 3.4 K 2010-05-27 - 12:09 GreigCowan  
GIFgif castor_raw.gif r1 manage 16.0 K 2009-12-07 - 10:53 UnknownUser  
PNGpng ce124.PNG r1 manage 39.0 K 2009-06-24 - 15:04 RobertoSantinel  
PNGpng firstpassjobs.png r1 manage 57.6 K 2009-06-19 - 11:49 RobertoSantinel  
PNGpng fromCERN.png r2 r1 manage 48.9 K 2009-05-13 - 11:20 RobertoSantinel  
PNGpng fromCERN2.png r1 manage 48.9 K 2009-05-13 - 11:21 RobertoSantinel  
PNGpng fromPIT.png r1 manage 104.4 K 2010-05-16 - 15:21 UnknownUser  
PNGpng getPlotImg-4.png r1 manage 73.3 K 2010-05-28 - 10:27 UnknownUser  
PNGpng getPlotImg.png r1 manage 54.2 K 2010-05-05 - 12:15 UnknownUser  
PNGpng gridka.png r1 manage 2.9 K 2009-10-19 - 13:46 UnknownUser  
PNGpng gridview.png r1 manage 9.9 K 2010-07-26 - 12:26 UnknownUser  
Bitmapbmp jobs_MC09.bmp r1 manage 1171.9 K 2009-05-07 - 14:05 RobertoSantinel  
PNGpng jobs_NIKHEF.png r1 manage 49.1 K 2009-05-11 - 12:24 RobertoSantinel  
PNGpng jobs_running.png r1 manage 83.7 K 2009-10-08 - 14:30 UnknownUser  
PNGpng last_24_hs_MC_activity.png r1 manage 63.4 K 2010-02-05 - 10:53 UnknownUser  
GIFgif lhcb_castor_0_0_PEND_RUNSTACKEDP_1.gif r1 manage 16.1 K 2009-11-06 - 15:49 UnknownUser  
PNGpng lhcb_raw.png r1 manage 21.4 K 2009-10-01 - 10:37 UnknownUser  
PNGpng mdst_transfers.png r1 manage 7.3 K 2010-05-25 - 09:55 UnknownUser  
PNGpng network_lhcbraw.png r1 manage 19.6 K 2009-09-30 - 14:41 UnknownUser  
JPEGjpg nice_event.jpg r1 manage 468.6 K 2009-12-14 - 14:39 UnknownUser  
PNGpng night_xfer.png r1 manage 8.1 K 2009-11-13 - 09:00 UnknownUser  
PNGpng pilot_last_day.png r1 manage 52.5 K 2009-09-11 - 10:26 UnknownUser  
JPEGjpg quality_transfers.JPG r1 manage 72.9 K 2009-05-11 - 12:20 RobertoSantinel  
PNGpng quality_transfers.png r1 manage 58.1 K 2009-05-11 - 12:17 RobertoSantinel  
PNGpng queued_tranfers.png r1 manage 3.3 K 2010-02-24 - 14:02 UnknownUser  
PNGpng queuedmdst.png r1 manage 10.7 K 2010-05-27 - 12:15 UnknownUser  
PNGpng reprocessing_jobs.png r1 manage 60.1 K 2009-12-16 - 09:34 UnknownUser  
PNGpng rump_up_18th.png r1 manage 81.1 K 2009-10-19 - 10:06 UnknownUser  
PNGpng running_jobs.png r1 manage 64.3 K 2009-07-29 - 14:00 RobertoSantinel  
PNGpng sara_ral.png r1 manage 232.7 K 2010-05-18 - 10:41 UnknownUser  
PNGpng throughputSTEP09.png r1 manage 53.5 K 2009-06-09 - 11:59 RobertoSantinel  
JPEGjpg throughput_real_data.jpg r1 manage 203.0 K 2010-05-11 - 12:09 UnknownUser  
PNGtiff throughput_real_data.tiff r1 manage 105.4 K 2010-05-11 - 12:04 UnknownUser  
PNGpng totalrunning.png r1 manage 101.3 K 2009-10-22 - 12:03 UnknownUser  
Bitmapbmp unscheduled_in2p3.bmp r1 manage 3000.1 K 2010-02-01 - 11:47 UnknownUser  
PNGpng userjobstalled.png r1 manage 42.5 K 2009-10-01 - 14:25 UnknownUser  
PNGpng weekend16.png r1 manage 85.4 K 2009-10-19 - 10:14 UnknownUser  
PNGpng wms203-1.png r2 r1 manage 10.3 K 2009-10-19 - 10:09 UnknownUser  
PNGpng wms203_ICE.PNG r1 manage 29.2 K 2009-12-11 - 14:19 UnknownUser  
PNGpng xfers.png r1 manage 8.2 K 2009-11-13 - 14:07 UnknownUser  
PNGpng xfers_to_cern.png r1 manage 48.6 K 2010-02-22 - 12:07 UnknownUser  
Edit | Attach | Watch | Print version | History: r1874 < r1873 < r1872 < r1871 < r1870 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1874 - 2019-06-03 - KonradKlimaszewski
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback