ADCOperationsWeeklySummaries2016

Instructions

  • Reports should be written on Monday morning by the CRC summarising the previous week's ADC activities and major issues
  • The WLCG SCOD should copy the report to the WLCG Operations Meeting Twiki after 14:30 CET on Monday
  • The issues are reported to the WLCG Operations Meetings at 15:00 CET on Monday by the CRC or by Ivan G and/or Tomas J.
  • ATLAS-internal issues should be documented daily in the ADCOperationsDailyReports pages.

Reports

ADCOperationsWeeklySummaries2017

https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ADCOperationsWeeklySummaries2017

12 December - 19 December

  • Production
    • Stable production levels using 250-300k cores
    • Regular data transfers after last week's problems with proxies in FTS
    • Continuing validation tests in view of big reprocessing
  • Problems
    • Almost all production and analysis jobs at INFN-T1 fail since Friday morning as they cannot see the input files - under investigation. Looks like a problem with an update to our pilots last week.
    • Derivation productions using fast merge produce unreadable output and downstream jobs cannot open the files - being checked.
  • Plan
    • Fill the Grid for the holiday period mostly with simulation and keep an eye or two (best effort) on it.
  • Extra
    • Best wishes to everybody!

5 December - 12 December

  • Production
    • High level of running jobs ~250k (except during Sunday/Monday morning, because of issues with FTS)
    • updated fairshare to finish derivation tasks (data 2015 already exists, data 2016 and mc 2015 must be reprocess with latest fixes)
    • problematic overlay transformation tasks - job efficiency was low during wednesday
    • reprocessing beamspot data (reading data from tapes), prepared 600M evtget MC for xmas to keep available resources busy
    • PanDA pilot update (problems with wrong eventType in traces fixed, this had an impact on rucio metadata / behavior)
  • Services
    • migration of 120 virtual machines to new OpenStack / hardware (mostly live migration transparent to users, only two machines needs manual intervention)
    • Frontier services overloaded (overlay transformation tasks, created special tag and PanDA queue with limited number of running tasks, redirect more sites to CERN Frontiers)
    • OS upgrade to CC7 for all rucio services
      • one server was automatically pushed to load-balancer before it was fully configured - immediately fixed
      • missing certificate proxy on rucio servers caused deletion problem (again promptly fixed and did not cause any damage - just less free scratch space)
      • 5 sites that use DPM+WebDAV servers configured long time ago by YAIM had incompatible list of ciphers for HTTPS - fixed without causing troubles
      • expired certificate proxy on FTS
    • FTS - expired ATLAS proxy
      • our script that delegate refreshed certificate proxy did not work on CC7 and existing proxy expired at 23:00 on Saturday (valid for 96 hours)
      • traced to key stored in credential cache on FTS server that was signed by MD5 and CC7 no longer support this signature algorithm

28 November - 5 December

  • Production
    • Big production of DAOD derivations on 2015/2016 pp data and MC and urgent upgrade task ongoing. Some difficulties to keep the share requested since upgrade task need high memory queue and derivation tasks take time to merge files.
    • Some 2016 data reprocessing tests with release 21 starting
    • High level of jobs running : 200-260k running job slots over the week
    • One misconfigured Powheg task that should use single core but was spawning over all the cores: problem known and fixed that reappeared because jobs was extension of an old dataset
  • Services
    • Migration of virtual machines ongoing (week 05 December - 09 December)
    • still CVMFS probe (stratum 1 and comb db) going up and down (behind revision problem)

21 November - 28 November

  • Reprocessing of physics delayed stream finishing successfully
  • Big production of DAOD derivations on 2015/2016 pp data started.
    Share for group production increased again to cope with large production.
  • CVMFS services were partially unavailable at 2 sites, ASGC and RAL, on Friday, Nov 25 evening.
    RAL was fixed quickly after starting cron job for updates.
    Later ASGC solved issues with instabilities on link to CERN. Since then has been running well.

15 November - 21 November

  • Reprocessing of physics delayed stream was running taking data from tape directly.
    Slow at the beginning, due to Panda using not evenly tape resources, putting too much load on CERN tapes.
    Additional staging requests from Tier0 activities (running re-calibration before large rel. 21 reprocessing of 2015+2016 data).
    There was also competition of download requests from users. Limits has been temporary reduced to 1TB/day. To be dscussed with CREM.
    Measures have been taken to have staging from tapes more distributed over all Tier1s. This helped reprocessing already over the weekend.
  • Slow recovery of disk space reported by some dCache instances (FZK, RRC-KI) after file deletion by ATLAS.
    Not clear if this is dCache service issue or site problem. Discussing with dCache support.
  • CVMFS service at TAIWAN is getting behind regularly in updating to latest version of contents.

8 November - 14 November

  • review of potential new nuclei sites to be done, want ~10 stable sites
  • lifetime deletion completed, 4PB cleared on nuclei sites
  • CERN tape slowness being investigated, news?
  • Need to reprocess physics delayed stream this week, 50k cores will be 1 week workload, from tape directly to see tape performance, plan is to leave staging to workflow and not pre-stage anything
  • atlas-adc-cloud-US didn't reply to GGUS:124909, this was FTS issue to TW, now recovered but were communcations ok?

1 November - 7 November

  • Production running fine, pre-MC16 samples submitted, preparing for Heavy Ion run.
  • Network intervention last Thursday morning, recovered fine.
  • Data loss at TAIWAN; GGUS:124597.
  • Tier1 disks are getting full, constant auto and manual re-balancing. Dedicated discussion this week for production input replicas.

24.10.2016 - 31.10.2016

  • Quiet week. Only disruption was the kernel upgrades with the security patch. Most sites did rolling upgrades (thanks!), only a few went into complete downtimes. All in operation now.

18.10.2016 - 23.10.2016

  • Frontier overload pinned down to DCS pixel high voltage monitoring and calibration data loaded for each event which is not needed in production at all.

11.10.2016 - 17.10.2016

  • Quiet week - CHEP2016
  • Ongoing MC12 reprocessing (single core)
  • Frontiers loaded / brought down on Friday due to nasty reprocessing tasks - under investigation.

4 October - 10 October

  • EOS - CASTOR rate tested, using xrootd not srm (no checksum validation) - ~ 5 - 6 GB/s
  • SARA downtime (till Oct 17) - Some running tasks were to be re-assigned. Unique datasets were identified, no effect in the production.
  • Running low priority MC12c samples.
chart44.png

27 September - 3 October

  • Activities and global report
    • Grid is running full capacity, production continues with the new Sherpa MC samples.
    • Productive Software and Computing week last week.
  • Problems:
    • EOSATLAS service was degraded September 27th night (high latencies between diskservers and the EOSATLAS headnode), stabilized one day later, shortage of disk space at Meyrin, geotagging disabled (IT ticket OTG0033142).

20 September - 26 September

  • Activities and global report
    • Production going well (above 250k running jobs) mostly event generation and simulation but also analysis and reprocessing tests.
    • Large back log of user jobs (peaked at 600k jobs) drained by Sunday afternoon.
    • T0 running grid jobs when CPU available
    • Some staging activity in order to prepare next reprocessing campaign
  • Problems:
    • Data
      • Lost file on EOS (16 Sept. Friday) due to - Xrootd client issue (dual open on write w/ fault on close) - EOS issued a patch to correct the issue.
      • Most T2 site problems were related to data (transfer, deletion, access)
    • Central services
      • Frontier server - atlast0frontier3-ai.cern.ch:8000 (connect: timeout) part of 4 node cluster. VM was restarted no information in Openstack logs to explain why the VM stopped.
      • e-group deletion activities inadvertently removed ATLAS users from zp group. https://cern.service-now.com/service-portal/view-outage.do?n=INC1136368
        • thank you to all the people involved. Perhaps a "lesson learned" would be useful, e.g. synchronization between egroup, AFS and computing accounts seems to be not fully clear to everybody (at least not to us). Also, still waiting for the fix for some of the users who still do not have the Computing group ZP first.

6 September - 12 September

  • Activities and global report
    • Production going well (above 250k running jobs) mostly event generation and simulation but also analysis and reprocessing tests.
    • Quite a high pressure of analysis jobs: few users sent huge production and one of them with very high memory consumption (this task was aborted).
    • T0 running grid jobs when CPU available
    • Some staging activity in order to prepare next reprocessing campaign
  • Problems:
    • Jobs
      • T1 site problems: high failure rate at RAL due to one WN, high failure at NIKHEF ongoing, one pool down at IN2P3 -CC;
    • Data
      • T1s storage full with few secondaries: data has to be regularly rebalanced between them
      • FZK deletion failure due to a bug in old dcache version when https is used ( permission error rather than file not found reported when a file is missing): site will upgrade to new dcache version and in the mean time srm protocol will be used
      • Most T2 site problems were related to data (transfer, deletion, access)
    • Central services
      • CVMFS stratum1 and CONDB glitches but service is working
      • AMI replica at CERN down this weekend, main service was working. Seems a synchronization problem, CERN servers ok . Solved today.

29 August - 5 September

  • Activities:
    • Spillover tasks: testing, new feature of Athena - event forking after first event, some issues with comparison of DQ histos
    • Reporcessing of physics_main date is planed, RAW2ALL test task ongoing http://bigpanda.cern.ch/task/9348904/
    • Protocols;
      • intensive effort to set gsiftp sites in US cloud
      • https/webdav used for deletion at many sites
  • Problems:
    • Winger to EOS stream stalled on Sunday for 30 minutes, no related issues
    • Kibana monitoring issues.
    • AMI service appeared red for two days in our monitoring, it is fixed now.
    • FZK-LCG2 space issues, after deletion it took three days than the space decrease was reported

22 August - 29 August

  • Activities:
    • normal running - grid is full
  • Problems:
    • RRC-KI-T1 finished the downtime - some files seemed to be unavailable (GGUS:123587) - files declared lost

15 August - 22 August

9 August - 15 August

  • Activities:
    • Grid almost full except on Sunday, when we ran out of jobs.
    • detection of dark data at RRC-KI-T1_DATADISK (108352 files), deletion in progress
    • improved Tzero usage by grid jobs
  • Problems:
    • again problems with some Frontier and squid servers due to overload by overlay tasks

2 August - 8 August

  • Activities:
    • Grid almost full.
    • Few tasks doing MC Overlay with real data are running on a dedicated site (BNL) to avoid creating troubles to the rest of the Frontier/DB infrastructure.
    • CERN - RAL network link now running with both the primary and secondary 10Gbps.
  • Problems:

26 July - 1 August

  • Activities:
    • Low in MC simulation but 60M events added mid-week. 800M Sherpa request not ready yet.
    • Some grid prodcution at T0, looking to use all 10k slots
  • Problems:
    • Kibana/ES unreliable again for important service monitoring GGUS:123021
    • RRC-KI DATADISK full but SRM query shows 450TB free so investigating inconsistency GGUS:123198
    • Castor garbage collection stuck causing pool to get full ALARM:123148
    • Triumf/cern link saturated 10Gbit/s
    • Overlay tasks stress Frontier when run at many T2 sites. Better squid HITS when restricting to BNL. Further work needed.
    • Heavy-Ion tasks using high memory >64G, queried with HI group to review situation as not sustainable

19 July - 25 July

  • Activities:
    • Two T0 spillover tasks completed fine, their derivations are done. ICHEP derivations almost done, last run is being processed, to finish today/tomorrow.
    • Low in MC simulation again, 10M events left to be done, asking MC team for keeping at least 100M events in the system.
  • Problems:
    • Export rate from EOS close to 6 GB/s, maximum throughput we can get with the current hardware.
    • Discussions last week with CMS on CERN-RAL network link saturation. Summary and actions put together.
    • TZDISK space getting tight, lifetime reduced to 2 weeks from 2.5 weeks, mostly RAW occupancy.
    • Data Quality monitoring service was affected by migration of nodes at CERN last week. The DQ server was replicated to another machine in the meantime. A long term robust solution with DQ team has been under discussion for some time.

5 July - 11 July

  • Activities:
    • T0 spillover ran at BNL on the biggest run (38 hours)
  • Problems:
    • Problems affecting EOS :
      • Namespace server crashed, then bad space reported after
      • Bad throughput rate from EOS to T1_TAPE for some files due to switches
    • Some T1s getting full. Deletion campaign of derived data is being done + rebalancing of data

28 June - 4 July

  • Activities:
    • LHC and data processing running full steam. Tier-0 saturated.
    • (Re)Testing T0 spillover (i.e. processing of T0 data on the Grid). Still some small differences being followed up.
  • Problems:
    • SARA had network problems just before the week-end, now seems fixed, but no report from the site (yet).

21 June - 27 June

  • Activities:
    • LHC and data processing running full steam. Record lumi reached !
    • (Re)Testing T0 spillover (i.e. processing of T0 data on the Grid)
  • Problems:
    • Alarm ticket on Sunday due to 100% failure rate writing to Castor (GGUS:122325). Moreover GGUS:122208 (Alarm ticket from last week) still open : "timeout in an internal communication between the CASTOR gridftp server and the diskmanager daemon"
    • RRC-KI-T1 file loss follow-up : 37k unique files lost out of 131k lost files. Will be regenerated.
    • EOS namespace crashed and restarted (twice last week)
    • Slow transfers from AGLT2 to CERN : GGUS:122293. Will work on automatic identification of these slow links.

14 June - 20 June

  • Activities:
    • LHC and data processing running full steam
    • Derivation production on MC finished
    • Expecting ramp up in analysis as summer conferences approach
  • Problems:
    • EOS "an end-of-file was reached globus_xio: An end of file occurred" Thurs/Fri last week and after restart/upgrade last night GGUS:122208
      • SRM space reporting is also not working since the upgrade GGUS:122233
    • RRC-KI-T1: Lost storage DB on 8 June. Last backup was 19 April. Recovery of the DB finished on 17th June, but still many files missing. ATLAS will run a consistency check to determine what is lost.
      • Would be nice to know WLCG expectations on DB backups

June 7 - June 13

  • Activities:
    • Derivation production started last week for ICHEP, done on data15/data16. MC derivations are running.
    • Database and Metadata TIM (13.06 - 15.06), ADC TIM (15.06 - 17.06).
  • Problems:
    • NTR.

31 May - 6 June

  • Activities:
    • MC16 simulation is being tested in production system. A new round of derivation production started today towards ICHEP, network traffic is increased due to this.
    • Review of derivations volume continues, above budget by 50%, semi-automated way of running on the common MC15c background samples (e.g. different pile-up profiles covered automatically) contributed to large volume, deletion request has been made to physics groups for unused samples.
    • Preparing for the ADC TIM meeting at CERN next week, June 15-17.
  • Problems:
    • Long standing issue with Condor/Cream causing CERN resources not fully used (https://its.cern.ch/jira/browse/ASPDA-196), patched binary provided by Condor team was deployed on the relevant CEs last week.
    • CERN-PROD_SCRATCHDISK was full, one user requested 100TB, deletion rate was slow due to small files (sorting by age), the queue was rearranged sorting by file size to speed up the deletion rate.

17 May - 23 May

  • Activities:
    • MC15c, derivation production, and Heavy Ion reprocessing ongoing - using 230k cores stably
    • ADC WM/DM review last week
  • Problems:
    • No particular problem this week apart from usual low-level data corruption and transfer troubles.

10 May - 16 May

  • Activities:
    • MC15c, derivation production, and Heavy Ion reprocessing ongoing
  • Problems:
    • T1 DATADISKs were getting ful but had secondaries - more space at the end of the week
    • Problem with RAL TAPE robot from Tuesday up to the weekend (now a bit of backlog being cleared)
    • Some transfers to CERN failing with "Tend-of-file was reached globus_xio" (GGUS:121550)

26 April - 2 May

  • Activities:
    • MC15c, derivation production, and Heavy Ion reprocessing ongoing
    • re-validation of T0 spillover mechanism
  • Problems:
    • T1 DATADISKs getting full - BNL, SARA, INFN, FZK - data being rebalanced
    • BNL-SARA transfers are slow (GGUS:120957)
      • both BNL-CERN and BNL-SARA transfer slowness seems to be fixed
      • to be tested by DDM ops
    • transfers from FZK-LCG2_MCTAPE are failing with "Transfer canceled because the gsiftp performance marker timeout" (GGUS:121163)

19 April - 25 April

  • Activities:
    • MC15c, derivation production, and Heavy Ion reprocessing ongoing
    • re-validation of T0 spillover mechanism
  • Problems:
    • Wigner connection problem cause no major service disruption
    • T1 DATADISKs getting full - BNL, FZK, NIKHEF - data rebalanced
    • FZK-LCG2 tape backend server broken (ELOG:57170,GGUS:121013) - transfers from tape fail with "Too many queued requests"
    • BNL-SARA transfers are slow (GGUS:120957)

12 April - 17 April

  • Activities:
    • MC15c (digi+reco reconstruction) ongoing
    • Heavy Ion reprocessing almost completed. Memory consumption reduced through less monitoring histos and a few other fixes.
    • Online->EOS->Castro throughput test ran OK.
    • All sites associated to WORLD cloud last Friday. Now Tier-3s and non-MCP Tier-2s used again for production (and full).
    • On-going discussion on queue attributes needed for job matching (especially memory).
  • Problems:
    • Discussion ongoing on working directory definition for ACT+ARC-CE jobs running on TW sites.
    • Changes to FTS job submission last Thursday produced an increase of jobs queued in FTS but not polled for status. To be discussed by the Rucio and FTS people.

05 April - 11 April

  • Activities:
    • MC15c (digi+reco reconstruction) ongoing
    • HeavyIon reprocessing started last Friday. Using now ~20k slots. Monitoring memory comsumption
  • Problems with with some central services (pilot factories, rucio-ui), now fixed.
  • In the process of closing the consistency check tickets opened last year to ask the sites to provide storage dumps.
  • Identified a list of 17 corrupted RAW files on Castor (GGUS:119750). Problem linked to faulty router corrupting the traffic on June 11th last year : https://cern.service-now.com/service-portal/view-outage.do?from=CSP-Service-Status-Board&n=OTG0022230. Insidious corruption since the adler32 of the corrupted files are the same as the good file ! But md5 do differ !!

29 March - 04 April

  • Activities:
    • MC15c (digi+reco reconstruction) running smoothly and taking up 120k slots. Producing approx 100M events per day, which is good.
    • HeavyIon reprocessing under testing, will start soon (this week most probably). It will require MCORE high memory slots.
    • Upgrade studies: running SCORE very high memory (more than 4GB needed), running in parallel now in very few sites.
    • Analysis as usual
  • VOMS-related issue fixed on Tuesday.
  • Disk full at T1_DATADISK : SARA, BNL. Will reshuffle data.
  • Consistency checks : Site reminded to provided regular dumps (at least quarterly, monthly even better)

22-28 March

  • MC15c (digi+reco reconstruction) running smoothly and taking up most of the grid slots
  • INFN-T1 downtime: 5 day downtime was scheduled at 2 days' notice was very disruptive to ATLAS operations. No mention at previous WLCG meetings and no personal notifications from the site. This is not acceptible from a T1.
  • A VOMS-related issue seems to be affecting both panda and FTS since the evening of the 28th. Services fail to extract FQANs from the proxy

15-21 March

  • New campaign of pile (digi+reco reconstruction) is running, about 70 Mevts per day (using 100k cores, 8-core jobs). This campaign will go on for the next months.
  • Heavy Ion tests. There will be likely a memory issue with this production. Currently trying to understand the requirements, this week need to check if jobs will work with 3GB per core.
  • FTS shares are not really working, annoying transient failures. There was an email discussion / thread about this at FTS level and steering.
  • INFN T1 transfer failures high, really low efficiency during the last 36h. Is it understood what happened? Many jobs are there we would it to be working during the week.
  • FZK-LCG2 has highly transfer failures.

1-7 March

  • Amazon EC2 scale test postponed from last week to be continued this week
  • Taiwan site requests until they solve issues with their CVMFS stratum one that it is not used
  • shortage of tasks caused some sites to be empty of ATLAS jobs at the end of the week

22-29 February

  • Amazon EC2 scale test to be continued this week
  • Problems with Taiwan site probably due to network. We observe problems with CVMFS there and with data transfers to/from NDGF
  • Reprocessing finished, another will start today (HI data)

15-21 February

  • Reprocessing activity high (up to 80k cores), should be only tails left by Wednesday
  • Taiwan cvmfs stratum 1 seems to be fixed GGUS:119557
  • Amazon EC2 scale test with up to 100k cores should ramp up on Tue/Wed
  • Some RAW files lost on TRIUMF tape, recovery from CERN is stuck due to strange checksum GGUS:119568
  • Campaign to delete old tape data will start soon, ATLAS will contact sites.


Major updates:
-- DavidCameron - 2016-02-19

Responsible: DavidCameron
Last reviewed by: Never reviewed

Edit | Attach | Watch | Print version | History: r59 < r58 < r57 < r56 < r55 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r59 - 2017-01-16 - PetrVokac
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Atlas All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback