Impact of COVID-19 on WLCG operations

Impact on Tier0

  • CERN Tier0 has over the last days largely migrated to service operation via teleworking and we see no major impact on the WLCG services or operations activities. The current plan is to maintain intervention and development activities as far as possible. The IT general infrastructure is designed for handling remote access to its facilities while not comprising computer security. In fact, we are operating remotely every year during the CERN annual closure. What is different now, is that there will be many more users working at a distance. To cope with the anticipated load, we are increasing the number of Windows Terminal servers for example and another 30 servers are ready to be used if necessary. We are also increasing the number of licences needed for Video Conferencing. But key to the IT operations is it staff. We have organized ourselves to operate mostly remotely, and have tested this mode of operation during the last weeks. The Service Desk will remain available to handle questions.
  • CERN however has limitations with regards to the use of some software packages (essentially engineering), whose licence conditions do not permit usage outside of the CERN fenced area. We are working with the respective vendors aiming at obtaining more flexible conditions.
  • List of T0 critical services hosted by CERN

Impact on Tier1 sites

Site Status and plans overview Remote operations Best effort Degraded Delayed/postponed upgrades, installation of new resources Major issues
BNL-ATLAS SDCC is in full remote operations mode at the moment. As BNL has transitioned to a minimal operating mode ("Min-Safe") at 1 p.m. ET on 03/23/2020, physical access to SDCC will be permitted only for hardware failures or to fix unexpected outages until further notice. Upgrades may be delayed. Yes        
FZK-LCG2 Most of the GridKa staff are working from remote and we don't expect any issues. For now, we are still allowed to enter the campus and we would also be able to fix hardware problems. Until when we are allowed to enter the campus, of course we don't know. The additional resources for the increased pledges for 2020 is already in place and only the storage part still needs to be configured, which can be done remotely. Yes        
IN2P3 -CC From 16th of March, CC-IN2P3, is operated remotely on a best-effort basis. Provided the fact hardware suppliers will be able to deliver spare parts, we should be able to handle light hardware failures with obviously some delays. Yes Yes   Deployment of the 2020 disk pledge will not to be done on time because hardware delivery is uncertain. CPU 2020 pledge are already on line  
INFN-T1 Italy is currently in quarantine because of Covid-19. As a result, almost all CNAF people work from home. This situation, in principle does not affect CNAF operations. We are authorized to go to CNAF in case of serious problems. On the other hand, we cannot guarantee, in the event of maintenance intervention by an external company, that this will happen quickly. For the same reason, we also expect significant delays in the delivery and installation of the new CPU and storage resources. Yes        
JINR-T1 No noticeable impact for Tier1 & Tier2 operation, we will work as usual and maintain our services          
KR-KISTI No impact has been on operations so far. Remote work at home could be an option if things get worse but this will not affect the operations, neither.          
NDGF-T1 The NDGF-T1 regulations vary slightly from country to country, but overall there are no plans to stop any service. Personnel is working remotely but will have access to most facilities in case of emergency, albeit it will take longer than usual due to commute time. The pledged hardware is installed, and though some still need to be validated, it should be possible to do remotely. Yes        
NL-T1 The NL-T1 is well equipped to continue normal operations in the current environment with coronavirus spread across Europe and the precautionary measures currently implemented in our country. All the necessary staff are able to connect remotely to all the necessary systems thus Normal Operation and Support can continue in the normal working hours.          
RAL-LCG2 Assuming a skeletal staff are allowed to continue to operate the data centre no immediate problems are foreseen operating the current service: All this years procurements have been installed in the machine room so all capacity should be available as planned for 1st April. However, major upgrades are likely to be delayed. Yes from 20th of March if necessary     Deployment of a new Tape Robot. An extended period of working-from-home (4+ months) could lead to a shortage of tape capacity. Upgrading Ceph from Luminous to Nautilus. Upgrading the LHCOPN link from 30Gb/s to 100Gb/s and joining the LHCONE.  
RRC-KI-T1 No so far: we work and will continue to. One possible neat is slightly longer hardware replacement process for non-trivial parts (mostly -- tape library and central internal switching fabric),though I don't expect (too) many of them, if at all (being slightly optimistic).          
TRIUMF-LCG2 For the Canadian Tier-1, no impact is expected for the time being. The region is not severely affected at the moment. There are measures in place regarding social distancing and working from home, which in any case would not be expected to have any significant impact on the Tier-1 operations.          
Taiwan-LCG2 TW-T1 center is stably operational as usual. The COVID-19 situation in Taiwan is contained for now. If the local situation becomes worse, a remote operation model will be implemented to ensure the Tier-1 center continuity.          
USCMS-FNAL For USCMS sites and the Fermilab facility in general, the expectation is to continue to meet all our WLCG commitments and maintain our services at their usual high level of reliability.        
pic Additional internal communication channels have been enabled for this period. CMS and ATLAS Tier-1 and ATLAS Tier-2 operations are fully supported. By 30th March, only essential services can operate in Spain. We can operate PIC, and if needed by a major incident one can go to the site, with permission from the University. This is expected to end by 11th April.... The 17th of March downtime was done fully remote (HTCondor and dCache upgrades) Yes          

Impact on Tier2 sites

  • USCMS sites. For USCMS sites the expectation is to continue to meet all our WLCG commitments and maintain our services at their usual high level of reliability.
  • ALICE T2. As of this moment (things may change soon), none of the ALICE T2 sites is instructed to shut down. They will keep the same operational principle as at CERN - essential intervention will be allowed.
  • UK Tier-2 sites are operating normally, though increasingly moving to "holiday mode" with little on-site presence. This may mean slower response to problems and the delay of upgrades. New procurements could be delayed but we have more than enough resources available to meet current pledges.
  • IFIC-LCG2. Site is operating normally. People is forced to work from home due to current restrictions. Expect delays if any on-site intervention is required.

  • DE Sites:
    • DESY-HH: reduced staff on site, rest working from home, pledged CPU and storage resources are on site but installation might take a bit longer
    • WUPPERTALPROD: no one on site anymore, all staff in home office. Tech people who live not far away are on standby. Longer outage expected in case of infrastructure failures (for example cooling).
    • GoeGrid: no one is the site of physics building and we work at home. The computing facility still works via ticket and phone. When some large system failures, we still may be able to get into the server room. The recovery naturally takes longer than our usual operation. The rest of the schedules such as new hardware installation and etc are unclear.
    • IEPSAS-Kosice and FMPhI-UNIBA: no staff on site, all in home office. Restricted access available in case of hardware failures. Expect delays if any on-site intervention is required.
    • Prague: no staff on site, all in home office. Access in case of HW issues available, but delays might occur. Storage servers in remote site (hostnames *.ujf.cas.cz) are turned off and will not be available while restrictions are in place.

  • FR sites: from March 16th, people is working from home due to current restrictions. Details for each French site below.
    • Sites operation:
      • GRIF: site will work normally as long as there is no major failure. Access to site is restricted to interventions for technical infrastructures and their security. Hardware replacement will be delayed.
      • IN2P3-CPPM: site will work normally as long as there is no major failure. Access to site is restricted to essential interventions. Delays can be expected.
      • IN2P3-IPNL: site will work normally as long as there is no major failure. Access to site is restricted to essential interventions. Delays can be expected.
      • IPHC/IRES T2: The IN2P3-IRES site will work normally as long as there is not major hardware / cooling failure. We have restricted access to the site and basic maintenance (like hardware replacement) can be performed.
      • IN2P3-LAPP: site will work normally as long as there is no major failure. Access to site is restricted to essential interventions. Delays can be expected.
      • IN2P3-LPC: site will work normally as long as there is no major failure. Access to site is restricted to essential interventions. Delays can be expected.
      • LPSC T2: The IN2P3-LPSC site will work normally if we have no major hardware failure. We have restricted access to the site.
      • IN2P3-SUBATECH: site will work normally as long as there is no major failure. Access to site is restricted to essential interventions with uncertainty about the possibility of a provider's intervention.
    • Impact on FR sites pledges for 2020:
      • GRIF: installation of some storage servers delayed. Some servers may be shut down if necessary to preserve the air conditioning.
      • IN2P3-CPPM: all pledges deployed.
      • IN2P3-IPNL: no pledge to deploy for 2020.
      • IPHC/IRES T2: deployment of last 80 To for ALICE delayed.
      • IN2P3-LAPP: all pledges already deployed. Additional capacity available in case of failure.
      • IN2P3-LPC: all pledges already deployed.
      • LPSC T2: end of deployment delayed.
      • IN2P3-SUBATECH: not deployed. Uncertainty about the possibility of installing the new servers.

  • CA Sites: from March 18th, most people are working from home due to current recommendations (no restrictions so far). Canadian sites have moved to a best-effort mode. Details for each site below.
    • SFU-T2: The site should operate normally. There are no restrictions to access the site, however it's recommended to not access the machine room.

  • Hong Kong Site: (March 19th) Most of the staff are working remotely and HK-LCG2 is in stable operation so far. However, physical installation & operation have been suspended, which may affect the provision of additional resources for the pledges for 2020.

  • IT Sites : no persons allowed on site, all staff working from home. On site interventions possible in case of recoverable hardware failures although with inevitable delays. No upgrades or new installations will be possible. An heavy reconfiguration of the INFN-MILANO storage to solve the recurrent issues with stagein and stageout was planned in these weeks and it's clearly not possible: oscillations in the site performance are expected to continue.

Impact on experiment operations

ALICE

ALICE continues to operate at a normal level, with almost all centres on lockdown. We are conscious of the pledge installation being delayed at many places and are pushing disk space-hungry productions to later in the year to ease the pressure on the sites. The current emphasis is on data analysis, which requires little additional disk space. Too early to say if this practice will be entirely successful. We appreciate the substantial efforts of the site administrators to keep the computing infrastructure operational in these extraordinary circumstances and wish everyone to be safe!

ATLAS

  • ATLAS distributed computing operations currently continues to work similarly as before
  • Continuing with nominal mix of production and user analysis as long as all sites are up and running and central infrastructure components like PanDA and Rucio are accessible.
  • Basically everyone is working from remote and/or from home
  • Continue with brief daily operations morning meetings and now with remote CRC shifter. Operations shifters are working from remote as before.
  • Will continue with gradual updates/upgrades of central ATLAS grid services like unified PanDA queues and some TPC tests.
  • Will avoid any too disruptive tests of infrastructure in the coming weeks

CMS

CMS expects to operate the distributed grid infrastructure without significant degradation, at this time. We expect hardware and software upgrades and changes to be delayed and resource levels to remain at the 2019 pledge level for many sites until the Covid-19 situation eases. We anticipate difficulties rotating the operations staff at the usual rate. There are ongoing high-priority production activities but no time-critical campaign(s) scheduled for the next month.

LHCb

LHCb distributed computing is up to now operating normally. The whole team is of course working from home, with constant dedication. Apart from that, we don't plan to see any visible reduction in the working time. We are aware of the difficulties that some sites might experience.

Input from EGI

  • Regarding the delivery of services to all communities, including WLCG, EGI are currently assessing the situation. In the majority of cases, services are continuing as normal, albeit with staff working remotely from their normal place of work. A press release has been made on the EGI website: https://www.egi.eu/news/egi-and-covid-19.

Security concerns

  • Security teams at EGI and OSG will continue to operate normally in the foreseeable future, but of course more widespread teleworking will increase security risks to sites, because people use their own computers at home, which may be more easily compromised.

Contribution of the WLCG resources for COVID-19 research

Edit | Attach | Watch | Print version | History: r30 < r29 < r28 < r27 < r26 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r30 - 2020-04-04 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback