TWiki
>
LHCb Web
>
ProductionOperations
>
ProductionOperationsWLCGdailyReports
(2022-09-26,
MarkWSlater
)
(raw view)
E
dit
A
ttach
P
DF
<noautolink> ---+ *Daily WLCG Operations calls* :: collection of LHCb reports Starting from April 2009, this twiki collects all the LHCb reports given to the *weekly WLCG calls* at 3pm Geneva time Mondays and Thursdays. These calls are attended by the LHCb Grid expert and/or Concezio Bozzi (on behalf of LHCb). CONNECTION details: For remote participation we use the Vidyo system. Instructions can be found [[https://indico.cern.ch/conferenceDisplay.py?confId=287280][here]] (Deprecated : [[https://audioconf.cern.ch/cgi-bin/conference?access_code=0119168][Alcatel URL]]). These reports have to be duly compiled by the GEOC as part of his mandate *Previous reports are available per year:* <br /> [[ProductionOperationsWLCG2018Reports][2018]] [[ProductionOperationsWLCG2017Reports][2017]] [[ProductionOperationsWLCG2016Reports][2016]] [[ProductionOperationsWLCG2015Reports][2015]] [[ProductionOperationsWLCG2014Reports][2014]] [[ProductionOperationsWLCG2013Reports][2013]] [[ProductionOperationsWLCG2012Reports][2012]] [[ProductionOperationsWLCG2011Reports][2011]] [[ProductionOperationsWLCG2010Reports][2010]] [[ProductionOperationsWLCG2009Reports][2009]] <br />%TOC{title="Jump to a date:"}% ---++26th September 2022 * No significant issues to report - business as usual. ---++19th September 2022 * Issues: * SARA: * Issues with pilot submission this morning. Restart of services seems to have sovled the problem. (GGUS:158934) * PIC: * Pilot submission problems (GGUS:158857). * Investigations ongoing. Error found in CE logs but cause is unknown at present * RRCKI: * Pilots getting killed due to hitting 8GB virtual memory limit * Question: Is there a general WLCG policy for virtual memory limits? ---++12th September 2022 * Issues: * IN2P3: * Another problem with Pilot Submission, similar to last week apparently (GGUS:158856) * PIC: * Pilot submission problems (GGUS:158857). * Investigations ongoing. Seems to be only a problem on one CE ---++5th September 2022 * Issues: * IN2P3: * Pilot submission problems (GGUS:158719) * NIKHEF: * Pilot submission problems (GGUS:158718). Investigations ongoing. ---++29th August 2022 * Apologies - Mark S may not be able to make the meeting * Issues: * RAL: - Problems with jobs uploading data (GGUS:158574) * SARA: - Problems with data access. Still investigating on LHCb side but is there any known problem on the SARA side? ---++22nd August 2022 * Issues: * RAL: - New hardware added to help with slow gateways problem (GGUS:156492) - Echo Deletion problems (GGUS:155120) - Issue of XRootd proxy serialising transfers identified being worked on * Timeout issues at SARA found to be due to limit on simultaneous connections (GGUS:153653) - Solution proposed of increasing number of connections and using a '-n' flag for hadd ---++ 15th August 2022 * Issues: * Any updates on the connection timeout issues at SARA (GGUS:153653) ? * Current status is that it is reproducable on both IPv6 and IPv4 on a test VM ---++ 8th August 2022 * Apologies - Mark S is away so can't join the meeting. * Issues: * Work ongoing on connection timeout issues at SARA (GGUS:153653) * Current status is that it is reproducable on both IPv6 and IPv4 on a test VM ---++ 1st August 2022 * Activity: * DIRAC in downtime from tonight for ~24 hours for updates to MySQL DBs * Issues: * Network configuration changed bewteen PIC and CNAF to fix asymmetry (GGUS:157955, GGUS:158004) * Looks to have worked but will keep an eye on it * Ongoing connection timeout issues at SARA (GGUS:153653) * Possibly isolated to IPv6, but tests ongoing * Becoming a signficant issue as data needs to be transferred to different sites to process them * RAL slow deletion problem update (GGUS:155120) * Request to try to add a timeout to gfal removal requests but this is not available in the gfal client at present ---++ 25 July 2022 * Issues: * Work ongoing on the issues transferring beteen PIC and CNAF/INFN (GGUS:157955, GGUS:158004) * Asymmetry between the route between CNAF and PIC identified. * WLCG Network Throughput team is following up with GARR, CNAF and PIC * Signficant transfer problems to RAL on Friday due to CEPH key ring issues (GGUS:158120). Solved in good order. ---++ 18 July 2022 * Issues: * Work ongoing on the issues transferring beteen PIC and CNAF/INFN (GGUS:157955, GGUS:158004) * Asymmetry between the route between CNAF and PIC identified. ---++ 11 July 2022 * Issues: * Issues transferring beteen PIC and CNAF/INFN. * Investigations on the PIC side seem to show routing problems to some IPs (GGUS:157955) * Ticket now opened against CNAF/INFN to see if they can verify this. (GGUS:158004) ---++ 27 September 2021 * Activity: * MC and WG productions; Restripping ongoing, user jobs * Issues: * NTR ---++ 20 September 2021 * Activity: * MC and WG productions; user jobs * Issues: * NTR ---++ 13 September 2021 *Activity: * MC and WG productions; user jobs *Issues: * NTR ---++ 6 September 2021 * Activity: * MC and WG productions; User jobs. * Issues: * RAL: Found a workaround for transfer issues at RAL. Problems caused by script being sourced at RAL WNs that set GFAL (and other) env variables (GGUS:153532) ---++ 30 August 2021 * Activity: * MC and WG productions; Staging Ongoing; User jobs. * Issues: * RAL: Still have transfer issues but progress is being made (GGUS:153532) * CNAF: Cannot access some files locally. Investigations ongoing (GGUS:153578) * If you Google 'GOCDB' you get directed to the pre-prod site and there's no discernible difference except the DTs aren't registered properly. Can banner be added please? ---++ 16 August 2021 * Activity: * MC and WG productions; Staging Ongoing; User jobs. * Re-Stripping Campaign *might* start next week * Issues: * RAL: Had an issue with Downtime over the weekend ---++ 2 August 2021 * Activity: * MC and WG productions; Some Staging; User jobs. * Issues: * NTR ---++ 26 July 2021 * Activity: * MC and WG productions; User jobs. * Large stripping campaign is being started soon - this will involve significant staging from everywhere * Issues: * CERN: Issues with FTS transfers over the last week. CTA admins investigating. (GGUS:153132) ---++ 19 July 2021 * Activity: * MC and WG productions; User jobs. * Issues: * NIKHEF: Issues with one CE (brug.nikhef.nl) for last week or so. Pilots being aborted and now submission problems (GGUS:152946) ---++ 12 July 2021 * Activity: * MC and WG productions; User jobs. * Issues: * NTR ---++ 5 July 2021 * Activity: * MC and WG productions; User jobs. * Issues: * CERN: Requested an increase of 1.5 PB to continue staging (GGUS:151868) * IN2P3: Failed transfers due to faulty RAID controller (GGUS:152859) ---++ 7 June 2021 * Activity: * MC and WG productions; User jobs. * Issues: * NTR ---++ 29 March 2021 * Activity: * MC and WG productions; User jobs. * Issues: * T0+1: Main issue is staging CTA -> RAL. Addressed in (GGUS:150898) ---++ 22 March 2021 * Activity: * MC and WG productions; User jobs. * Issues: * CERN: Migration to CTA fully over. In production. Some failed FTS transfers to RAL (checksum issue), some TPC issues found and 2 xroot issues open * T0+1: moving to using Singularity for payload isolation. Few issues found at RAL last week (Singularity inside Docker), would need to re-check if still there. ---++ 15 March 2021 * Activity: * MC and WG productions; User jobs. * Issues: * CERN: Migration to CTA is ongoing. Testing, re-opening the valves this afternoon if everything's OK. ---++ 01 March 2021 * Activity: * MC and WG productions; User jobs. * Issues: * CERN: Migration to CTA to start this week, putting CASTOR in read-only mode on Wednesday. ---++ 22 February 2021 * Activity: * MC and WG productions; User jobs. * Issues: * CERN: ticket (GGUS:150647) for checksum issues, in progress * Migration to CTA might start next week, and we might stop CASTOR this week * Submissions to HTCondorCEs not anymore problematic (some tweaking on our side) ---++ 15 February 2021 * Activity: * MC and WG productions; User jobs. * Issues: * CERN: ticket (GGUS:150584) for jobs HELD at CERN * ticket was closed - we lost pilots there. * IN2P3:Job submission issue (GGUS:150406) * was closed * GRIDKA: Job submission issue (GGUS:150403) * from the above tickets, and from other ones: we have a general issue on how to run on HtCondorCEs. We don't want to run a local schedd as this is not something formally requested. As a temporary measure, we have been adding a line in our submission string so that the job (pilot) outputs is deleted in the next 24 hours, but this might be not ideal/fragile. A proper solution will not come easily (would require development) and this is not on the horizon. ---++ 01 February 2021 * Activity: * MC and WG productions; User jobs. * Issues: * IN2P3:Job submission issue (GGUS:150406) * GRIDKA: Job submission issue (GGUS:150403) ---++ 11 January 2021 * Activity: * MC and WG productions; User jobs. * Issues: * IN2P3: xroot TPC issues. Investigation ongoing * NL-T1: preparation for dCache namespace migration ongoing ---++ 04 January 2021 * Activity: * MC and WG productions; User jobs. * Issues: * IN2P3: Number of Running jobs decreased: (GGUS:150078) ---++ 14th December 2020 * NTR * Activity: * MC and WG productions; User jobs. ---++ 30 November 2020 * PIC dCache namespace migration today without downtime * Activity: * MC and WG productions; User jobs. ---++ 16 November 2020 * IN2P3 dCache namespace migration to be performed 17/11/20 (tomorrow) without downtime * Activity: * MC and WG productions; User jobs. ---++ 09 November 2020 * IN2P3 dCache namespace migration to be performed 17/11/20 without downtime * Activity: * MC and WG productions; User jobs. ---++ 02 November 2020 * Activity: * MC and WG productions; User jobs. ---++ 26 October 2020 * Activity: * MC and WG productions; User jobs. ---++ 19 October 2020 * Activity: * MC and WG productions; User jobs. * Issues: * RAL: Issue with one user unable to access data due to auth issue (GGUS:148701) Resolved ---++ 21 September 2020 * Activity: * MC and WG productions; User jobs. * Issues: * RAL: Issue with one user unable to access data due to auth issue (GGUS:148701) ---++ 7 September 2020 * Activity: * MC and WG productions; User jobs. * Issues: * CNAF: Issue discovered with Storm release from ~10 days ago that causes puppet to restart service. Puppet temporarily stopped until fix is released. * IN2P3: Issue with users unable to download data from storage due to auth problem (GGUS:148550) ---++ 31 August 2020 * Activity: * MC and WG productions; User jobs. * Issues: ---++ 24 August 2020 * Activity: * Usual MC, user and WG production. * Issues: * SARA/NIKHEF, GGUS:148321 : Timeout during transfers from SARA-BUFFER. Already investigated last week and now the problem is much mitigated but still there. Will updated ticket. ---++ 10 August 2020 * Activity: * Usual MC, user and WG production. * Issues: * IN2P3 decreased number of running job : Fixed * RAL FTS3 transfers issue : GGUS:148187 : Fixed ---++ 03 August 2020 * Activity: * Usual MC, user and WG production. * Issues: * IN2P3 failing pilots: GGUS:148116 * GridKa IPv6 issues: GGUS:148115 ---++ 27 July 2020 * Activity: * Usual MC, user and WG production. * Namespace migration of dCache completed last week. Many thanks to GridKa and dCache experts for help with this. ---++ 20 July 2020 * Activity: * Usual MC, user and WG production. * Tomorrow, dCache namespace re-ordering at Gridka * Exceptionaly agreed to have two T1 (CNAF/GRIDKA) down at the same time ---++ 13 JULY 2020 * Activity: * Usual MC, user and WG production. * Preparing Bookkeeping Downtime for Oracle DB upgrade on Wednesday ---++ 13 JULY 2020 * Activity: * Usual MC, user and WG production. * Preparing Bookkeeping Downtime for Oracle DB upgrade on Wednesday ---++ 29 JUNE 2020 * Activity: * Usual MC, user and WG production. * Issues: * PIC: DATA transfer problem GGUS:147655 ---++ 15 JUNE 2020 * Activity: * Usual MC, user and WG production. * Issues: * CERN: DATA transfer problem GGUS:147426 (fixed) * FZK-LCG2: CE unavailable GGUS:147431 (fixed) * RAL: low number of running jobs ---++ 08 JUNE 2020 * Activity: * Usual MC, user and WG production. Started middle last week the restripping of PbNe 2018 data (should be finished in 1 or 2 days). * Issues: * PIC: Tape buffer became full due to data recall for restripping campaign. Now buffer size increased (GGUS:147344). ---++ 01 JUNE 2020 * Activity: * Usual MC and working group productions * Issues: Nothing new to report. ---++ 25 May 2020 * Activity: * Usual MC and working group productions * Issues: Nothing new to report. ---++ 18 May 2020 * Activity: * Ongoing WG MC productions, Heavy Ion 2018 stripping validation * Issues: Nothing new to report ---++ 11 May 2020 * Activity: * Ongoing WG MC productions * Issues: * GRIDKA: Some ongoing file access issues. FZK are waiting on vendor. (GGUS:146379) ---++ 4 May 2020 * Activity: * Ongoing WG MC productions * We have ticketed all Tier 2 sites that are still running SL6 to ask them to upgrade to CC7. Not surprisingly, they have mostly said it'll have to wait until after quarantine! * Issues: * CERN: Issues with file access on EOS. Solved and ticket to be closed (GGUS:146673) * GRIDKA: Some ongoing file access issues. FZK are waiting on vendor. (GGUS:146379) ---++ 27 April 2020 * Activity: * Preparing for the next Stripping round * Ongoing WG MC productions * Issues: * SARA: Data transfer problems (GGUS:146658) ---++ 20 April 2020 * Activity: * Preparing for the next Stripping round * Ongoing WG MC productions ---++ 6 April 2020 * Activity: * Staging for next stripping round. ---++ 30 March 2020 * Activity: * Stripping. * Staging for next stripping round. * Issues: * PIC: Data transfer problem (GGUS:145956) ---++ 23 March 2020 * Activity: * Stripping campaign is finished. * Staging is finished. * Issues: * NTR ---++ 16 March 2020 * Activity: * Stripping campaign ongoing, occupying most of T0/1 capacity (no T2s) * Staging for 2016 is almost finished. Staging for 2017 is ongoing. * Issues: * NTR ---++ 24 February 2020 * Activity: * Stripping campaign ongoing, occupying most of T0/1 capacity (no T2s) * Staging for 2016 is almost finished. Staging for 2017 is ongoing. * Validating "new" role for SAM/ETF tests. * Issues: * CERN: News about the issues last week? (no DT?). * CNAF: low level of running jobs (GGUS: 145692). No reply yet. ---++ 17 February 2020 * Activity: * Stripping campaign ongoing, occupying most of T0/1 capacity (no T2s) * Staging for 2016 is almost finished. Staging for 2017 is ongoing. * Issues: * no significant issues ---++ 10 February 2020 * Activity: * Stripping campaign ongoing, occupying most of T0/1 capacity (no T2s) * Staging re-started last week, almost finished everywhere * Issues: * no significant issues ---++ 27 January 2020 * Activity: * Stripping campaign ongoing * Staging will re-start this week based on the stripping activity * Issues: * no significant issues ---++ 20 January 2020 * Activity: * Heavy staging at sites will start again during the week because of stripping campaign ---++ 13 January 2020 * Activity: * Running MC simulation, WG productions, user analysis. * Issues: * RAL: failing access by protocol to files in ECHO ---++ 6 January 2020 * Activity: * Running MC simulation, WG productions, user analysis. * Issues: * RAL: failing access by protocol to files in ECHO ---++ 16 December 2019 * Activity: * Running MC simulation, WG productions, user analysis. * Issues: * RAL: failing access by protocol to files in ECHO ---++ 9 December 2019 * Activity: * Running at ~100K jobs: MC simulation, WG productions, user analysis. ---++ 2 December 2019 * Activity: * Smooth running at ~100K jobs: MC simulation, WG productions, user analysis. * Had already staged 2015,2016,2017 data for the re-strippign campaign, awaiting for physics groups validation * Issues: * no significant issues ---++ 25 November 2019 * Activity: * Smooth running at ~120K jobs: MC simulation, WG productions, user analysis. * have staged 2015,2016,2017 data for the re-strippign campaign, awaiting for validation by physics groups * Issues: * no significant issues ---++ 18 November 2019 * Activity: * Smooth running at ~120K jobs: MC simulation, WG productions, user analysis. * have staged 2015,2016,2017 data for the re-strippign campaign, awaiting for validation by physics groups * Issues: * no significant issues ---++ 11 November 2019 * Activity: * Usual MC, user jobs and data restripping. * Continuing staging (tape recall) at all T1s * Issues: * Slow transfers (timeouts) for data upload from WN's to external SEs at RAL and FZK (being looked at) * FTS transfer failures from SARA (presumably network problems at some pool nodes, nodes rebooted) * RAL: failing direct access to files in ECHO if several applications access the same file simultaneously * IN2P3 - singularity is still not available, becoming urgent ---++ 04 November 2019 * Activity: * MC, user jobs and data restripping. * Continuing staging (tape recall) at all T1s * Issues: * NTR new ---++ 28 October 2019 * Activity: * MC, user jobs and data restripping. * Continuing staging (tape recall) at all T1s * Issues: * CNAF: Files can not be accessed (ticket: 143816, in progress) * GIRDKA: ARC CE Unavailable (ticket: 143814, in progress ) * IN2P3: It's the only T1 that is not ready for singularity. No representative! ---++ 21 October 2019 * Activity: * MC, user jobs and data restripping. * Continuing staging (tape recall) at all T1s * Issues: * GRIDKA: Issue with Data transfers saturday morning, fixed after a few hours. Investigating for lost files. ---++ 30 September 2019 * Activity: * MC, user jobs and data restripping. * Continuing staging (tape recall) at all T1s * Issues: * RAL: * GGUS:142350; (old ticket) Under investigation : problem with user jobs at RAL. * GGUS:143323; Problem deleting files on ECHO at RAL. * RAL running jobs at much lower level since last Wednesday, than previously ---++ 23 September 2019 * Activity: * MC, user jobs and data restripping. * Continuing staging (tape recall) at all T1s * Issues: * CERN: * GGUS:143301; Unavailable files on EOS * RRCKI: * GGUS:143318; very low staging efficiency * NIKEFF: * GGUS:143318; aborted pilots * RAL: * GGUS:142350; (old ticket) Under investigation. User jobs increased, no queue. Issue seems to continue. ---++ 16 September 2019 * Activity: * MC, user jobs and data restripping. * Continuing staging at T1s * Issues: * RAL: * GGUS:142350; (old ticket) Under investigation. User jobs increased, no queue. Issue seems to continue. ---++ 9 September 2019 * Activity: * MC, user jobs and data restripping. * Massive staging at all T1 * Issues: * RAL: * GGUS:142350; Under investigation. User jobs increased, no queue. Issue seems to continue. ---++ 2 September 2019 * Activity: * MC, user jobs and data restripping. * Massive staging at all T1 * Issues: * RAL: * GGUS:142350; Under investigation. User jobs increased, no queue. Issue seems to continue. ---++ 26 August 2019 * Activity: * MC, user jobs and data restripping. * Massive staging at all T1 * Issues: * RAL: * GGUS:142350; still issues accessing files on ECHO. Under investigation. * CNAF: * Data transfer problem GGUS:142803; Fixed ---++ 19 August 2019 * Activity: * MC, user jobs and data restripping. * Massive staging at all T1 * Issues: * RAL: * GGUS:142350; still issues accessing files on ECHO. Under investigation. * CNAF: * Outage. DT extended until Wed. * IN2P3 * Staging problem ---++ 12 August 2019 * Activity: * MC, user jobs and staging. * In the coming weeks, we will perform a lot of tape recall. * Issues: * RAL: * GGUS:142350; still issues accessing files on ECHO. Under investigation. * CNAF: * Outage since mid last week. DT extended until Wed. aprox. * PIC: * Problem with ONLINE.git access (ticket opened 142673 ). Response: heavy load, response "not optimal" * IN2P3 * FTS failures IN2P3-RDST -> BUFFER (ticket opened 142670 ). Thousands of "[SE][StatusOfBringOnlineRequest][SRM_FAILURE] ". No response on the ticket. * NIKEF * No jobs running ( 142680 ), coming back now (this morning). Ticket can be closed. ---++ 5 August 2019 * Activity: * MC, user jobs and staging. * in the coming weeks we will perform a lot of tape recall. * remind T2D sites that they should close their solved tickets * Issues: * CBPF: * GGUS:142580; Network problems. * RAL: * GGUS:142350; still issues accessing files on ECHO. Under investigation. ---++ 29 July 2019 * Activity: * MC, user jobs and staging. * Important to have all sites with split aliases otherwise need to maintain an hack in the code. Still missing SARA (split is difficult form their side). * Issues: * RAL: * GGUS:142350; accessing files problem on ECHO. Under investigation. ---++ 22 July 2019 * Activity: * MC, user jobs and staging. Validation for reprocessing of 2011 and 2012 data finished, will start productions soon. * Almost all sites agreed to split aliases for end-point of disk and tape. Still missing SARA (split is difficult form their side). Important to have all sites with split aliases otherwise need to maintain an hack in the code. * Issues: * RAL: * GGUS:142350; problem accessing files on ECHO. Under investigation. * GGUS:142337; issues with killed pilots. Under investigation ---++ 15 July 2019 * Activity: * MC, user jobs and staging. * Issues: * CNAF: All data transfers Failed at INFN-T1; ticket GGUS:142239; fixed ---++ 26 June 2019 * Activity: * MC, user jobs and re-stripping of 2018 data. * Issues: * SARA-MATRIX: Files access problem; alarm ticket GGUS:141783; fixed * CNAF: Data transfers failure; alarm ticket GGUS:141790; fixed * RAL: Problem with CASTOR GGUS:141872 ---++ 17rd June 2019 * Activity: * Usual activity running at 110k jobs: MC, user jobs and re-stripping of 2018 data. * Issues: * NTR ---++ 3rd June 2019 * Smooth running at ~100K jobs, Usual activity * User jobs, MC productions, and WG productions this week * Issues * RAL: * Timeouts when accesing job input data (GGUS:141462) * Auth failures for accesing files by user jobs (GGUS:141262) * CERN: * Poor transfer efficiency from CERN WN to outside storage GGUS:141112 ---++ 27th May 2019 * Smooth running at ~100K jobs, Usual activity * User jobs, MC productions, and WG productions this week * issues which are not significant, but potentially may be of interest to other experiments: * in progress: Poor transfer efficiency from CERN WN to outside storage GGUS:141112 * Users getting : [FATAL] Auth failed at RAL GGUS:141262 ---++ 20th May 2019 * Usual activity * User jobs, MC productions, and staging this week * no significant issues to report ---++ 13th May 2019 * Activity * User jobs, MC productions, and staging this week * Issues * CERN: * since Friday communication problems between lbvobox'es and CNAF CE (job submission gets stuck) https://cern.service-now.com/service-portal/view-incident.do?n=INC1983869 * observe low efficiency of transfers from CERN WNs to outside SEs (GGUS:141112) (The problem starts from ~8th Apr, but the amount of such transfers is very low, so they are not visible in the overall plots) ---++ 6th May 2019 * Activity * User jobs, MC productions, and staging this week * Issues * CERN: * still opened ticket for Failed jobs GGUS:140149 * RAL: * Continuing migration from Castor to ECHO ---++ 29th April 2019 * Activity * User jobs, MC productions, and staging this week * Issues * RAL: * Continuing migration from Castor to ECHO * IN2P3: * Unscheduled warning downtime this morning for Patch for NFS mount problem ---++ 15th April 2019 * Activity * User jobs, MC productions, staging and some reprocessing this week. * Issues * RAL: * Continuing migration from Castor to ECHO * A disk server (gdss811) is down - causing various hold-ups and slow-downs of the different productions and the migration * PIC : Machine ran out of disk space (GGUS:140715) fixed now - thanks! * IN2P3 : Batch system issues (GGUS:140652) possibly ongoing ---++ 8th April 2019 * Activity * User jobs, MC productions, staging and some reprocessing starting this week. * Issues * RAL: * A restart of docker killed a number of jobs last week. RAL investigating the course ( GGUS:140589) * A disk server was in a bad state that caused timeouts on opening some files (GGUS:140599) ---++ 1st April 2019 * Activity * User jobs and MC productions * Issues * CERN: several tickets open: * Corruped file(s) on EOS - GGUS:140476 * Pilots hold at ce514.cern.ch - GGUS:140503 * PIC: All pilots failed. There was an error in the JobRouting definition in HTCondor-CE - solved ( GGUS:140482) ---++ 25th March 2019 * Activity * User jobs and MC productions * Issues * CERN: several tickets open: * Pilots hold at ce514.cern.ch - GGUS:140349 * Data transfer from CLOUD GGUS:140037 * Jobs Failed GGUS:140149 * FTS3: broken answer on monitor calls, GGUS:140024 * IN2P3: All data transfers Failed at IN2P3-CC, but problem solved and understood - it was due to the CRL update ( GGUS:140354) ---++ 18th March 2019 * Activity * User jobs and MC productions * Issues * CERN: Some tickets for CERN are still open * EOS GGUS:139927 * Data transfer from CLOUD GGUS:140037 * Jobs Failed GGUS:140149 * IN2P3: downtime * CNAF: FTS3 transfers to QMUL ---++ 11th March 2019 * Activity * User jobs and MC productions * Issues * CERN: Some tickets for CERN/EOS are still open, even thogh problems mostly gone. Not clear why. ---++ 4th March 2019 * Activity * User jobs and MC production * Issues * CERN: Some ongoing EOS issues both writing and reading GGUS:139927 * CERN/VOMS: Proxy renewal for SAM tests has stopped working. VOMS team investigating GGUS:139920 (Update: This looks like it is now fixed!) ---++ 25th February 2019 * Activity * User jobs and MC production * Issues * CERN: LHCb EOS data transfer problem GGUS:139871; GGUS:139875; Solved * CNAF: Network connection problem ---++ 18th February 2019 * Activity * User jobs and MC product * Stripping s35 ---++ 11th February 2019 * Activity * User jobs and MC product * Stripping s35 * Sites Issues * * CNAF: Data transfer problem (GGUS:139608) ---++ 4th February 2019 * Activity * User jobs and MC product * Stripping s35 and s35r1 for PbPb * Sites Issues * CERN : EOS issue (GGUS:139463); Solved? * PIC: Data downloading problem (GGUS:139492) * RAL: Jobs Failed with Segmentation fault (GGUS:139414) ---++ 28th January 2019 * Activity * Data reconstruction for 2018 data on going * User jobs running and MC jobs at "full steam" * Sites Issues * CERN : NRT * Tier-1s : NTR ---++ 21st January 2019 * Activity * Data reconstruction for 2018 data * User and MC jobs * Sites Issues * CERN : NRT * Tier-1s : NTR ---++ 14th January 2019 * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * CERN : (GGUS:139077) closed, thanks Jan. Reclcle bin "back to a nice safety margin". * RAL : Aborted pilots (GGUS:139081) ---++ 7 January 2019 * Activity * Data reconstruction for 2018 data * User, WG processing and MC jobs * Site Issues * CERN : Curious to know the status of the CERN cloud T-systems (GGUS:139080), RHEA (GGUS:138848) * CERN : Also ran out of space in EOS "recycle-bin" (GGUS:139077) earlier today. Requesting a shorter retention period for now, before we decide on further measures * RAL : Aborted pilots (GGUS:139081) Also a few other issues over the holiday period which were resolved either internally or through GGUS tickets. ---++ 17 December * Activity * Data reconstruction for 2018 data * User and MC jobs * Staging data for reprocessing in 2019 * Site Issues * SARA: Ticket open during the weekend concerning tape migration issues, fastly fixed saturday night... Thanks a lot! * Thanks all Sites for this great year! ---++ 10 December * Activity * Data reconstruction for 2018 data * User and MC jobs * Staging data for reprocessing in 2019 * Site Issues * SARA: Ticket open concerning data transfers problems (GGUS:138472). A couple of others also waiting probably related (GGUS:138362, GGUS:138293) * RRC-KI ticket opened about pilot issues (GGUS:138637). Solved very fast. * IN2P3 : Transfers slow (GGUS:137918). * IN2P3 : Staging issues (GGUS:138719). Fixed fast - thanks! ---++ 03 December * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * SARA: Ticket open concerning data transfers problems (GGUS:138472) site waiting on CERN input ---++ 26 November * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * SARA: Ticket open concerning data transfers problems (GGUS:138472) ---++ 19 November * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * PIC: Data Access problems during the weekend, solved * GRIDKA: Downtime declared for tomorrow * SARA: Ticket open concerning data transfers problems (GGUS:138293) ---++ 12 November * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * CERN: spike in failed jobs on Sunday, currently investigating, FTS delegation issue (GGUS:138063) * SARA,IN2P3 FTS3 data transfer problem SARA <=> IN2P3 (GGUS:137967) (GGUS:137972) * RAL: FTS issues (server removed from Configuration) (GGUS:137822) * IN2P3: Decreased transfer efficiency (GGUS:137918) ---++ 5 November * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * CERN: FTS delegation issue (GGUS:138063) * SARA,IN2P3 FTS3 data transfer problem SARA <=> IN2P3 (GGUS:137967) (GGUS:137972) * RAL: FTS issues (server removed from Configuration) (GGUS:137822) * IN2P3: Decreasing number of jobs (GGUS:138090) * IN2P3: Decreased transfer efficiency (GGUS:137918) ---++ 29 October * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * SARA,IN2P3 FTS3 data transfer problem SARA <=> IN2P3 (GGUS:137967) (GGUS:137972) * RAL: FTS issues (server removed from Configuration) (GGUS:137822) ---++ 22 October * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * RAL: FTS issues (server removed from Configuration) (GGUS:137822) ---++ 15 October * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * NTR ---++ 8 October * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * NTR ---++ 1 October * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * CERN: Problem with accessing files (GGUS:137079) ---++ 24 September * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * CERN: Problem with accessing files (GGUS:137079) * GRIDKA: Data transfer problem (GGUS:137329) ---++ 17 September * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * CERN: Problem with accessing files (GGUS:137079) ---++ 10 September * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * CERN: Pilot submission problem (GGUS:137037); Solved * CERN: Problem with accessing files (GGUS:137079) * CNAF: Minor problems at worker nodes ---++ 3 September * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * RAL: Failing disk server at RAL resulting in jobs failing to get input data ---++ 27 August * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * CERN: The reported problem with uploads to EOS via xrootd (GGUS 136720) is likely related to the LHCb bundled grid middleware, the fix is being tested * RAL: ipV6 connection problems resulting in failed FTS transfers (GGUS 136863) * RAL: Failing disk server at RAL resulting in jobs failing to get input data ---++ 13 August * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * NTR ---++ 06 August * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * CERN: CVMFS problem at RHEA ( GGUS:136330); solved * RAL: FTS3 issue (GGUS:136199); in progress * CNAF: data transfer problem (GGUS:136123); in progress ---++ 30 July * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * CERN: CVMFS problem at RHEA ( GGUS:136330) * CERN: FTS3 issue (GGUS:136275); Fixed * RAL: FTS3 issue (GGUS:136199); Fixed * CNAF: data transfer problem (GGUS:136123); under investigation * CNAF: pilots Failed (GGUS:136120); Resolved ---++ 23 July * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * CERN: File transfers problems. Looks like it is related to a problematic FTS server. Under investigation ( GGUS:136275 ) ---++ 16 July * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * IN2P3: There's a ticket for file transfer errors. "Better now" but need to be investigated ( 136067 ) * CNAF: Ticket opened (136120) for failing pilots; under investigation. Another ticket for file transfer errors, in progress(136123). ---++ 9 July * Activity * Data reconstruction for 2018 data * User and MC jobs * Site Issues * NTR ---++ 2 July * Activity * Data reconstruction for 2018 data, MC simulation, user jobs ---++ 25 June * Activity * Data reconstruction for 2018 data, MC simulation, user jobs ---++ 18 June * Activity * Data reconstruction for 2018 data, MC simulation, user jobs * Site Issues * CNAF: transfer issues (GGUS:135657) and job issues (GGUS:135668), both look ok now ---++ 11 June * Activity * Data reconstruction for 2018 data, MC simulation, user jobs * Site Issues * CERN: CVMFS issue (GGUS:135601) ---++ 4 June * Activity * Data reconstruction for 2018 data * Site Issues * NTR ---++ 28 May * Activity * Data reconstruction for 2018 data * Site Issues * NIKHEF: Pilots Failed (GGUS:135325) during weekend; Fixed. * Most pilots at ce515.cern.ch finished "successfully" without matching jobs due to missing CVMFS. ---++ 30 April * Activity * HLT farm off for MC * Updates * NTR * Site Issues * CERN: Staging issues. Many Conection reset by peer on Castor (GGUS:134755), related to FTS proxy-renewal. Noticed from 25th, bump of failures also this morning. ---++ 23 April * Activity * HLT farm to be used for some more time in parallel with the trigger * Updates * Deploying LHCbDIRAC with GLUE2 support today * Site Issues * IN2P3: Some tape files lost (GGUS:134666), recopied from other sites. * PIC: Staging problems (GGUS:134667) * CNAF: LHCb completed data management actions after the long downtime. ---++ 16 April * Activity * HLT farm to be used for some more time in parallel with the trigger * Site Issues * SARA: data access problems (GGUS:134545) being worked on * CNAF: working with the site to resurrect last 60 files for the re-stripping ---++ 9 April * Activity * HLT farm fully running * 2017 data re-stripping almost 100% finished * Stripping 29 reprocessing is ongoing * Site Issues * SARA: Data transfer issues (GGUS:134451). Being tracked down as a CRL issue. * CNAF: IPv6 issues on one CE (GGUS:134456) already solved * IN2P3: xroord server maybe broken (GGUS: 134441) ---++ 19 March * Activity * HLT farm fully running * 2017 data re-stripping ongoing * Stripping 29 reprocessing is ongoing * Site Issues * CNAF: IPv6 (GGUS:134088) and STORM issues. * GRIDKA: Pilot can not connect to the CS ( GGUS:133879) ---++ 12 March * Activity * HLT farm fully running * 2017 data re-stripping ongoing * Stripping 29 reprocessing is ongoing * Site Issues * CNAF: coming back to life, but storage not working since Sunday evening * Tier2D * Users with UK certificate problems solved by upgrading xrootd server ---++ 05 March * Activity * HLT farm fully running * 2017 data re-stripping ongoing * Stripping 29 reprocessing is ongoing * Site Issues * IN2P3: Failed pilots (GGUS:133824) * Tier2D * Users with UK certificate are having problem to access data at CBPF, Glasgow, CSCS, NCBJ (3 DPM, 1 dCache) GGUS:133667, GGUS:133617 ---++ 26 February * Activity * HLT farm fully running after dip over the weekend * MC simulation and user jobs * 2017 data restripping ongoing * Started stripping 29 reprocessing * Site Issues * NTR ---++ 19 February * Activity * HLT farm fully running * MC simulation and user jobs * 2017 data restripping should be started * Site Issues * NTR ---++ 12 February * Activity * HLT farm fully running * MC simulation and user jobs * Site Issues * CERN/T0 problem with updating DBOD - LHCbDirac was in downtime almost week ---++ 05 February * Activity * HLT farm fully running * 2016 data restripping, MC simulation and user jobs * Site Issues * CERN/T0 * Staging issues (https://ggus.eu/index.php?mode=ticket_info&ticket_id=133266) ---++ 29 January * Activity * HLT farm is partially running * 2016 data restripping, MC simulation and user jobs * Site Issues * CERN/T0 * NTR * T1 * RAL: Data transfers problem ALARM ticket (GGUS:133082); Solved. * IN2P3: Data transfers problem (GGUS:133081); Solved, but there was no reply on the ticket for two days. * SARA: No running jobs (GGUS:133089) ---++ 22 January * Activity * HLT farm "returning" from cooling maintenance.(no jobs running yet) * 2016 data restripping running full steam. Almost all data processed (waiting for CNAF) * Monte Carlo productions using remaining resources. * Meltdown & Spectre, several voboxes rebooting this week. * Site Issues * CERN/T0 * NTR * T1 * GRIDKA problems with FTS transfers(from and to) and "put and register". (fixed. Checked during meeting) ---++ 15 January * Activity * Running at maximum possible amount of resources. HLT farm stopped yesterday and returns "when cooling is stable again" * 2016 data restripping running full steam. Approx 1/2 of data processed (without CNAF) during YETS * Monte Carlo productions using remaining resources. * Meltdown & Spectre, performance hit after fix expected to be less cricital for data processing and monte carlo jobs (accounting for vast majority of work carried out). * voboxes patch: reboot will be tomorrow. * Site Issues * CERN/T0 * NTR * T1 * RRCKI problems with FTS transfers currently under investigation. * RAL had issues during weekend. "Burst"(jobs) reduced and all looks OK today. ---++ 8 January * Activity * Running at maximum possible amount of resources, including fully available HLT farm during YETS * 2016 data restripping running full steam. Approx 1/2 of data processed (without CNAF) during YETS * Monte Carlo productions using remaining resources. * Meltdown & Spectre, performance hit after fix expected to be less cricital for data processing and monte carlo jobs (accounting for vast majority of work carried out). * Need to patch voboxes, waiting for instructions from CERN * Site Issues * CERN/T0 * ALARM ticket (GGUS:132628) for EOS transfer problems fixed internally by LHCb * T1 * RRCKI problems with FTS transfers currently under investigation ---++ 18 December * Activity * Stripping validation, user analysis, MC * Site Issues * T1 * RAL: problems with file upload (GGUS:132540) - possibly solved. Internal ticket opened about pilots killed at RAL (not by LHCb). * SARA : Waiting for end of downtime. * Missing files : RAW files found missing at RRCKI (recovered), PIC(recovered) and IN2P3 (under investigation). * CERN : * Brief downtime of multiple database services yesterday. Also possibly a similar issue last week too. * Staging failures (GGUS:132516) - we hope that the 3day timeout request is not for long term. * Missing files on tape (GGUS:132525) - solved? ---++ 11 December * Activity * Stripping validation, user analysis, MC * Site Issues * T1 * RAL: problems with file download from Castor (GGUS:132356) * RRC-KI: Downtime for the tape storage update, should be finished now * FZK: foreseen network maintenance on the 12 Dec, expect possible temporary connectivity problems; temporary file inavailability due to disk pool migration (should be mostly transparent for the users) ---++ 27 November * General * almost no free disk space left, still waiting for complete disk pledged deployment 2017 * Activity * Stripping validation, user analysis, MC * Site Issues * T1 * SARA: problems with transfers today (GGUS:132067), no longer observing * RRC-KI: problems with file access, reported as fixed * FZK: one WN without CVMFS (GGUS:132064), solved close to instantly ---++ 20 November * Activity * Stripping validation, user analysis, MC * Site Issues * NTR ---++ 13 November * Activity * New round of stripping validation before launching the campaign. * Site Issues * INFN-T1: * Several issues b/c of the site outage in all areas of experiment distributed computing. Currently working on an analysis of the situation also in view of upcoming data processing campaigns. ---++ 1 November * Activity * Monte Carlo simulation, data processing and user analysis * Validation for restripping completed; waiting responce * Site Issues * T0: * User had incorrect mapping at EOS; fixed * * T1: * NTR ---++ 30 October * Activity * Monte Carlo simulation, data processing and user analysis * pre-staging progressing well * Site Issues * T0: * NTR * * T1: * CNAF Alarm ticket (GGUS:131366) solved ---++ 23 October * Activity * Monte Carlo simulation, data processing and user analysis * pre-staging approx 50% complete, progressing well * Site Issues * T0: * NTR * * T1: * NTR ---++ 16 October * Activity * Monte Carlo simulation, data processing and user analysis * pre-staging of 2015 data for reprocessing progressing well, ~ 1/3 of data on disk buffers * Site Issues * T1: * INFN-T1 tape buffer running full, fixed by site admins * RAL disk server down with effects on production workflows * Aob * Request grid wide deployment of latest HepOSlibs meta-rpm, including deployment of git client. ---++ 09 October * Activity * Monte Carlo simulation, data processing and user analysis * pre-staging of 2015 data for reprocessing is started and will continue during weeks. * Site Issues * * T1: * Failures in transfers to and from RRCKI over the weekend, solved now. * NL-T1 worker nodes in downtime tomorrow and Wednesday. * Failed transfers from IC to SARA (IPV6) (GGUS:129946); Problem probably in SARA connection to LHCOne ---++ 02 October * Activity * Monte Carlo simulation, data processing and user analysis * pre-staging of 2015 data for reprocessing is started and will continue during weeks. * Site Issues * * T1: * Failures in transfers to and from GRIDKA (GGUS:130848); This was due heavy load on dCache. It is stable now. * Files uploads and downloads failure at CNAF, due to hardware failure, which already fixed. * Missing expatbuilder at NIKHEF-ELPROD (GGUS:130832); solved * Failed transfers from IC to SARA (IPV6) (GGUS:129946); Problem probably in SARA connection to LHCOne * * T2: * Problems with pilots failing to contact LHCb services at CERN from WNs at Liverpool (GGUS:130715); solved ---++ 25 September * Activity * Monte Carlo simulation, data processing and user analysis (running more than 100K jobs) * Site Issues * * T1: * Failed transfers from IC to SARA (IPV6) (GGUS:129946); Problem probably in SARA connection to LHCOne * Problem downloading from SARA (GGUS:130692). Solved promptly by SARA - thanks. Problem with stuck dCache space manager * * T2: * Problems with pilots failing to contact LHCb services at CERN from WNs at Liverpool (GGUS:130715) ---++ 18 September * Activity * Monte Carlo simulation, data processing and user analysis (running more than 100K jobs) * Site Issues * * T1: * Failed transfers from IC to SARA (IPV6) (GGUS:129946); "Geant have confirmed that they are unable to ping mouse1.grid.sara.nl from geant-lhcone-gw.mx1.lon.uk.geant.net" ---++ 11 September * Activity * Monte Carlo simulation, data processing and user analysis * Site Issues * * T1: * Failed transfers from IC to SARA (IPV6) (GGUS:129946); "Geant have confirmed that they are unable to ping mouse1.grid.sara.nl from geant-lhcone-gw.mx1.lon.uk.geant.net" * Access file problem at GRIDKA (GGUS:130478) ---++ 4 September * Activity * Monte Carlo simulation, data processing and user analysis * Site Issues * T0: * Incomplete python installation at worker nodes (GGUS:130018) * T1: * Failed transfers from IC to SARA (IPV6) (GGUS:129946); no news * Failed transfers from many sites to dCache sites, see (GGUS:130190); Resolved by using proper parameter in SRM * We have peak of failed transfers at EOS every day at 5:00 (GGUS:130335) ---++ 28 Aug (Monday) * Activity * Monte Carlo simulation, data processing and user analysis * Site Issues * T0: * Incomplete python installation at worker nodes (GGUS:130018) * T1: * Failed transfers from IC to SARA (IPV6) (GGUS:129946) * Failed transfers from many sites to dCache sites, see (GGUS:130190) ---++ 21 Aug (Monday) * Activity * Monte Carlo simulation, data processing and user analysis * Site Issues * T0: * Problem with EOS in the night between fri and sat (GGUS:130137). Become an alarm in the morning of sat. Fixed now, 3/7 grid-ftp doors were mis-behaving * * T1: * NTR ---++ 14 Aug (Monday) * Activity * Monte Carlo simulation, data processing and user analysis * Site Issues * T0: * Problem with installation of python possibly broken on multiple WNs (GGUS:130018) - ongoing issue * * T1: * Problems uploading to various SEs - For SARA, tracked in GGUS:129946. Now also seen in FZK, IN2P3 and PIC - to be tracked and tickets opened if needed. ---++ 7 Aug (Monday) * Activity * Monte Carlo simulation, data processing and user analysis * Site Issues * T0: * Key VO Box (lbvobox103) unavailable (lost?) due to hypervisor problem. (GGUS: 129942). No GGUS (or Service Now) updates since yesterday morning. Having to recreate services on other VO Boxes. * T1: * NTR ---++ 31 Jul (Monday) * Activity * Monte Carlo simulation, data processing and user analysis * Site Issues * T0: * NTR * T1: * NTR ---++ 17 Jul (Monday) * Activity * Lots of user analysis(some failling) and Monte Carlo simulation * Site Issues * T0: Jobs hold at HTCondor CEs CERN-PROD (GGUS:129147) * T1: * NTR ---++ 10 Jul (Monday) * Activity * User analysis and Monte Carlo simulation * Site Issues * T0: Jobs hold at HTCondor CEs CERN-PROD (GGUS:129147) * T1: * NTR ---++ 03 Jul (Monday) * Activity * User analysis and Monte Carlo simulation * Site Issues * * T0: Jobs hold at HTCondor CEs CERN-PROD (GGUS:129147) * T1: * IN2P3: Downloads and Uploads issues during weekend, fixed ---++ 26 Jun (Monday) * Activity * User analysis and Monte Carlo simulation * Site Issues * * T0: Jobs in hold state at HTCondor CEs (GGUS:129147) * T1: * IN2P3: the running jobs are decreasing (GGUS:129141) * RAL: timeout on RAL storage (GGUS:129059) ---++ 29 May (Monday) * Activity * User analysis and Monte Carlo simulation * Site Issues * * T0: * Network/DNS outage yesterday caused problems for a few hours. All recovered now. ---++ 29 May (Monday) * Activity * User analysis and Monte Carlo simulation * Stripping v24 is almost finished. * Site Issues * * T1: * RAL: data access problem (GGUS:128398) ---++ 22 May (Monday) * Activity * User analysis and Monte Carlo simulation * New validation of Stripping v24 has been started. * Site Issues * * T1: * RAL: disk server failures during the weekend ---++ 15 May (Monday) * Activity * Stripping v24 waiting for developers. Almost 100k jobs running * Site Issues * * T0: * EOS downtime this morning. ---++ 8 May (Monday) * Activity * Stripping v28 over, Stripping v24 waiting for developers. * Site Issues * * T1: * RAL: Disk server failure last week, back in production today. Some FTS3 timeouts during staging. ---++ 24 April (Monday) * Activity * MC Simulation, Data Stripping and user analysis * Site Issues * T0: Some ongoing problems with Condor CEs (GGUS:127553) * * T1: * RAL: Still running with a limit on the number of Merge jobs to avoid problems with storage (GGUS:127617). Hoping these problems will be fixed by the CASTOR upgrade a week on Wednesday ---++ 18 April (Tuesday) * Activity * MC Simulation, Data Stripping and user analysis * Staging campaigns are ongoing for Data Stripping. * Site Issues * T0: SRM problems fixed quickly last week (GGUS:127638) * * T1: * CNAF: Uploading problems over the weekend fixed (GGUS: 127728) Due to another VO's GPFS usage pattern. * RAL: running with a limit on the number of Merge jobs to avoid problems with storage (GGUS:127617) but better than the situation before the version downgrade. * RRCKI: running with a limit on the number of user jobs due to limits on concurrent open files in dCache (no GGUS for this) * * T2: Seeing SL6.9 openssl problems at several sites. Tickets issued. ---++ 10 April (Monday) * Activity * MC Simulation, Data Stripping and user analysis * Staging campaigns are ongoing for Data Stripping. * Site Issues * T0: * Transfer errors from the job: could not open connection to srm-eoslhcb.cern.ch (GGUS:127638) * * T1: * RAL: two alarm tickets are opened during the weekend: * CEs were not accepting new jobs (GGUS:127612) * SRMs were down (GGUS:127617) * CNAF: failed contact to the SRM: could not open connection to storm-fe-lhcb.cr.cnaf.infn.it:8444 (GGUS:127608) ---++ 03 April (Monday) * Activity * MC Simulation, Stripping * Staging campaign for Stripping27, Stripping28 and Stripping24b, as well as 2015 EM should take 6 to 7 weeks with peaks of staging. * Site Issues * T0: * The 3 gridftps doors were saturated. Added 2 new one. * * T1: * RAL: suffering huge issue with SRM. Under investigation * CNAF: Stager was blocked for a while * FZK: Seem to have found a somewhat correct balance between timeouts and performance for transfers ---++ 27 March (Monday) * Activity * MC Simulation, Stripping * Staging campaign for Stripping27, Stripping28 and Stripping24b should take 6 to 7 weeks with peaks of staging. * Site Issues * T0: * CERN-HIST-EOS and FAILOVER communication error (GGUS:127308) and (GGUS:127309). Seems solved. * * T1: * RAL: disk server gdss780 is currently unavailable. * CNAF: Added an additional drive for staging. * PIC: LTO5 drive is supposed to be replaced today. could be slower than usual. * FZK: FTS transfers fail (GGUS:127301). Under investigation. ---++ 20 March (Monday) * Activity * MC Simulation, Stripping * Database backup locking and long queries from us on Friday caused severe distribution to LHCb production management system over weekend and into today, both for data and MC. A lot of manual work has been done to resolve inconsistencies. * Site Issues * T0: * ALARM ticket GGUS:126874 about users running out of AUP signature validity. User AUP validity overwritten with admin rights. No update since 9th March. * GGUS: 127148 has jobs being killed (rather than just limited by cgroups) when using more than 2GB of physical memory when there is contention. LHCb VO ID card requests 4GB of virtual memory and jobs typically work with significantly less than 2GB RSS for almost all of their duration. * * T1: * FZK: Some ongoing issues with submission timeouts to the new ARC CEs with arc-2-kit not working at all (GGUS:127075). Also GGUS:127122 with transfer timeouts causing lots of queued transfers in our production system. * CNAF: GGUS: 127129 had a number of file transfer failures but this problem seems to be ok now. We have also had files which appear to have been transferred successfully but aren't there in reality, but this appears to be a consequence of the database problems we had on Friday rather than due to CNAF. ---++ 13 March (Monday) * Activity * MC Simulation, Stripping campaign now started so tape systems will start to be hit * Site Issues * T0: * ALARM ticket GGUS:126874 about users running out of AUP signature validity. User AUP validity overwritten with admin rights. No updates since 2nd March - any more news on fixes? * Ready for network outages on Wednesday morning - Thanks for shifting the DT from 22nd to 15th as well! * * T1: * FZK: Some ongoing issues with submission timeouts to the new ARC CEs with arc-2-kit not working at all (GGUS:127075) ---++ 6 March (Monday) * Activity * MC Simulation, Stripping campaign to start this week which will increase load on T1 tape systems * Site Issues * T0: * Wed: ALARM ticket GGUS:126874 about users running out of AUP signature validity. User AUP validity overwritten with admin rights. * Observed CVMFS failures on batch and cloud machines (GGUS:126876). Failure rate decreased now * * T1: * SARA: SRM problems over the week-end (GGUS:126937). Currently cannot test if fixed b/c site is in DT * FZK: Switching to ARC-CEs only. Last week sw update for ARC-CEs produced failures (GGUS:126882). CREAM-CE submission already stopped from LHCb side. * PIC: Currently in DT for dCache upgrade. Batch closed but CEs open --> produces aborted pilots on LHCb side. ---++ 27th February (Monday) * Activity * MC Simulation, user analysis and data reconstruction jobs * Site Issues * T0: * Some settings were changed at EOS SRM and should have fixed last week's problem * Intervention on LHCb offline production database (LHCBR) to new hardware Wednesday 01/03/2017 from 10am to 12pm ---++ 20th February (Monday) * Activity * MC Simulation, user analysis and data reconstruction jobs * Site Issues * NTR ---++ 13th February (Monday) * Activity * MC Simulation, user analysis and data reconstruction jobs * Site Issues * T0: * Second instance of SRM for EOS LHCb is in production. Original EOS SRM reports zero for available space time from time. * T1: * CNAF: Downtime for 3 days * SARA: Downtime tomorrow (1 hour) ---++ 6th February (Monday) * Activity * MC Simulation and user analysis; reco jobs starting again * Site Issues * T0: * Major problems with SRM for LHCb use of EOS, unable to use it for most of the weekend (GGUS:126378). This led to loss of the results from 10,000s of jobs on HLT farm as it only connects to CERN. Appears to be resolved now: initial overloading led to avalanche of failures and retries, all increasing the load. Looking at ways to avoid this with IT and within LHCb. * T1: ---++ 30th January (Monday) * Activity * MC Simulation and user analysis * Site Issues * T0: * HTCondor problem (GGUS:126132) * T1: * RAL: CVMFS problem (squid issue) (GGUS:126241); Fixed * GRIDKA: problems at ARC CEs (GGUS:126135) -++ 23rd January (Monday) * Activity * Mainly running simulation on grid only resources, HLT back and running ~10K jobs. ~67K jobs total * Site Issues * T0: * EOSLHCB "very slow" via SRM last week. Back to normal (GGUS:126037) * T1: * RAL: Storage in Downtime today from 10:30 to 12:30 * CNAF: announced a Downtime on 13th and 14th February for changing of core switch ---++ 16th January (Monday) * Activity * Mainly running simulation on grid only resources, HLT off b/c of maintenance * Site Issues * T0: * The LHCb internal name of the T0 batch resources has been renamed from LCG.CERN.ch to LCG.CERN.cern (to distinguish it from other .ch resources) * T1: * RAL: file access issue: User can not open files(GGUS:125856), will be followed up after the Wed DT ---++ 9th January (Monday) * Activity * very high activity during the Christmas break: running more than100k jobs (new record for LHCb!) * Data reconstruction (proton-ion) almost finished, MC and user analysis. * Site Issues * T0: * NTR * T1: * Transfer problem from GRIDKA to CBPF (GGUS:125789) * RAL: file access issue: User can not open files(GGUS:125856). </noautolink>
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
png
2010-08-17_Transfer_Errors.png
r1
manage
82.0 K
2010-08-18 - 14:11
VladimirRomanovsky
png
2010-08-17_Transfer_Spike.png
r1
manage
91.2 K
2010-08-18 - 14:11
VladimirRomanovsky
png
2010-08-17_Transfer_Succeed.png
r1
manage
92.0 K
2010-08-18 - 14:12
VladimirRomanovsky
png
2010-08-17_Transfer_Succeed_1Week_SARA_CNAF.png
r1
manage
62.0 K
2010-08-18 - 14:12
VladimirRomanovsky
png
24hoursatRAL.png
r1
manage
73.7 K
2010-09-03 - 11:58
UnknownUser
png
AFS_availability.png
r1
manage
2.5 K
2009-11-13 - 09:27
UnknownUser
pdf
Alarm_ticket_test_1st_of_October.pdf
r4
r3
r2
r1
manage
38.4 K
2009-10-02 - 15:44
UnknownUser
pdf
Alarm_ticket_test_8th_of_October.pdf
r2
r1
manage
41.0 K
2009-10-08 - 13:56
UnknownUser
pdf
Analysis_at_Tier1s.pdf
r1
manage
580.7 K
2009-05-04 - 14:12
RobertoSantinel
png
CERN_24h.png
r1
manage
73.4 K
2010-11-10 - 09:54
UnknownUser
png
CNAF-M-DST.png
r1
manage
3.0 K
2009-09-02 - 14:35
UnknownUser
png
GRIDKA-LHCb_MC_M-DST.png
r1
manage
2.8 K
2009-08-06 - 13:42
RobertoSantinel
png
LFC.png
r1
manage
2.9 K
2009-06-11 - 12:08
RobertoSantinel
png
Last24MC.png
r1
manage
72.0 K
2010-10-19 - 17:11
UnknownUser
png
Manchester.png
r1
manage
44.6 K
2009-09-15 - 14:01
UnknownUser
png
PIC-LHCb_MC_M-DST.png
r1
manage
3.0 K
2009-08-06 - 13:41
RobertoSantinel
png
PIC-MC-M-DST.png
r1
manage
3.0 K
2009-09-02 - 14:24
UnknownUser
png
QMUL.png
r1
manage
44.5 K
2009-09-15 - 14:02
UnknownUser
png
Running_jobs.png
r1
manage
16.0 K
2009-07-17 - 09:52
RobertoSantinel
png
SARA-LHCb_MC_M-DST.png
r1
manage
3.0 K
2009-08-06 - 13:42
RobertoSantinel
png
SLS_replication.png
r1
manage
0.9 K
2010-02-19 - 14:38
UnknownUser
png
SVN.png
r1
manage
2.9 K
2010-03-18 - 12:16
UnknownUser
png
Transfer_throughput.png
r1
manage
72.2 K
2009-08-04 - 13:53
RobertoSantinel
docx
UK_sites_issue.docx
r1
manage
122.0 K
2010-07-01 - 12:00
UnknownUser
png
activity.png
r1
manage
71.2 K
2010-05-16 - 14:07
UnknownUser
png
castor_queued_transfers.png
r1
manage
3.4 K
2010-05-27 - 12:09
GreigCowan
gif
castor_raw.gif
r1
manage
16.0 K
2009-12-07 - 10:53
UnknownUser
png
ce124.PNG
r1
manage
39.0 K
2009-06-24 - 15:04
RobertoSantinel
png
firstpassjobs.png
r1
manage
57.6 K
2009-06-19 - 11:49
RobertoSantinel
png
fromCERN.png
r2
r1
manage
48.9 K
2009-05-13 - 11:20
RobertoSantinel
png
fromCERN2.png
r1
manage
48.9 K
2009-05-13 - 11:21
RobertoSantinel
png
fromPIT.png
r1
manage
104.4 K
2010-05-16 - 15:21
UnknownUser
png
getPlotImg-4.png
r1
manage
73.3 K
2010-05-28 - 10:27
UnknownUser
png
getPlotImg.png
r1
manage
54.2 K
2010-05-05 - 12:15
UnknownUser
png
gridka.png
r1
manage
2.9 K
2009-10-19 - 13:46
UnknownUser
png
gridview.png
r1
manage
9.9 K
2010-07-26 - 12:26
UnknownUser
bmp
jobs_MC09.bmp
r1
manage
1171.9 K
2009-05-07 - 14:05
RobertoSantinel
png
jobs_NIKHEF.png
r1
manage
49.1 K
2009-05-11 - 12:24
RobertoSantinel
png
jobs_running.png
r1
manage
83.7 K
2009-10-08 - 14:30
UnknownUser
png
last_24_hs_MC_activity.png
r1
manage
63.4 K
2010-02-05 - 10:53
UnknownUser
gif
lhcb_castor_0_0_PEND_RUNSTACKEDP_1.gif
r1
manage
16.1 K
2009-11-06 - 15:49
UnknownUser
png
lhcb_raw.png
r1
manage
21.4 K
2009-10-01 - 10:37
UnknownUser
png
mdst_transfers.png
r1
manage
7.3 K
2010-05-25 - 09:55
UnknownUser
png
network_lhcbraw.png
r1
manage
19.6 K
2009-09-30 - 14:41
UnknownUser
jpg
nice_event.jpg
r1
manage
468.6 K
2009-12-14 - 14:39
UnknownUser
png
night_xfer.png
r1
manage
8.1 K
2009-11-13 - 09:00
UnknownUser
png
pilot_last_day.png
r1
manage
52.5 K
2009-09-11 - 10:26
UnknownUser
jpg
quality_transfers.JPG
r1
manage
72.9 K
2009-05-11 - 12:20
RobertoSantinel
png
quality_transfers.png
r1
manage
58.1 K
2009-05-11 - 12:17
RobertoSantinel
png
queued_tranfers.png
r1
manage
3.3 K
2010-02-24 - 14:02
UnknownUser
png
queuedmdst.png
r1
manage
10.7 K
2010-05-27 - 12:15
UnknownUser
png
reprocessing_jobs.png
r1
manage
60.1 K
2009-12-16 - 09:34
UnknownUser
png
rump_up_18th.png
r1
manage
81.1 K
2009-10-19 - 10:06
UnknownUser
png
running_jobs.png
r1
manage
64.3 K
2009-07-29 - 14:00
RobertoSantinel
png
sara_ral.png
r1
manage
232.7 K
2010-05-18 - 10:41
UnknownUser
png
throughputSTEP09.png
r1
manage
53.5 K
2009-06-09 - 11:59
RobertoSantinel
jpg
throughput_real_data.jpg
r1
manage
203.0 K
2010-05-11 - 12:09
UnknownUser
tiff
throughput_real_data.tiff
r1
manage
105.4 K
2010-05-11 - 12:04
UnknownUser
png
totalrunning.png
r1
manage
101.3 K
2009-10-22 - 12:03
UnknownUser
bmp
unscheduled_in2p3.bmp
r1
manage
3000.1 K
2010-02-01 - 11:47
UnknownUser
png
userjobstalled.png
r1
manage
42.5 K
2009-10-01 - 14:25
UnknownUser
png
weekend16.png
r1
manage
85.4 K
2009-10-19 - 10:14
UnknownUser
png
wms203-1.png
r2
r1
manage
10.3 K
2009-10-19 - 10:09
UnknownUser
png
wms203_ICE.PNG
r1
manage
29.2 K
2009-12-11 - 14:19
UnknownUser
png
xfers.png
r1
manage
8.2 K
2009-11-13 - 14:07
UnknownUser
png
xfers_to_cern.png
r1
manage
48.6 K
2010-02-22 - 12:07
UnknownUser
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r1986
<
r1985
<
r1984
<
r1983
<
r1982
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r1986 - 2022-09-26
-
MarkWSlater
Log In
LHCb
LHCb Web
LHCb Web Home
Changes
Index
Search
LHCb webs
LHCbComputing
LHCb FAQs
LHCbOnline
LHCbPhysics
LHCbVELO
LHCbST
LHCbOT
LHCbPlume
LHCbRICH
LHCbMuon
LHCbTrigger
LHCbDetectorAlignment
LHCbTechnicalCoordination
LHCbUpgrade
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LHCb
All webs
Copyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback