Proposed LHCOPN operational model

The scope of this operational model is only the LHCOPN, which precise list of sites and links is detailed on page NamingConventionAndLinksIDs. A LHCOPN link is a dedicated link part of the network specifically put in place to allow distribution of data from T0 to T1s.

Foundations

Drawing conventions

DC.png

Main lines of processes are explained below them, in a hierarchical way:
1.1 Process one, step one
1.1 Process one, step two

2.1 Process two, step one
2.2 Process two, step two

The process should be done in order while it is often possible to change steps' order without breaking global processes.

Actors

Actors.png

  • LCG: Large Hadron Collider Computing Grid
  • L2 NOC: Network operating centre for L2 services
  • DANTE: Delivery of Advanced Network Technology to Europe http://www.dante.net/
  • NREN: National Research and Education Network http://en.wikipedia.org/wiki/National_research_and_education_network
  • GÉANT2: The Pan-European network http://www.geant2.net/ , managed by DANTE
  • Sites: LHC T0/T1s
  • Router Operators: People in charge of network devices on sites
  • Grid Data contact: People in charge of the data transfers occurring on the LHCOPN. They are the main users of the LHCOPN.
    • This is a generic role in charge of the interactions with the Grid world (impact assessment & broadcasting...) - Could be implemented by anybody, but e.g Grid people
  • DANTE Operation:
    • Role: Supervising and coordinating L2 and L3 monitoring deployment
  • ENOC - EGEE Network operating centre
    • Role: Help Designing processes for the LHCOPN, fit with Grid operations and drive design of the LHCOPN TTS
  • LQA: LHCOPN Quality assessment
    • Role: Statistics and assessment of infrastructure and processes

Actors and information repositories management

AIRM.png

The responsibility depicted is about setting up and ensuring the working of information repositories, not about theirs contents.

Information repositories location:

The private area on CERN's twiki is accessed (read and write) to anyone authenticated on CERN twiki (Nice account, twiki light account...). A registration page is here. The aim is not to prevent people to access information but more to avoid some potentially sensitive information to be fully disclosed (changes in IPs, ACL, security, report on the LHCOPN...) and indexed by web crawlers.

Information access

IA.png

Processes

The thresholds are:

  • Any event with an impact on the service must be reported with at least 1 ticket per issue not per event.
  • For non service impacting events, those lasting more than 1 hour or occurring more than 5 times an hour should be reported in the TTS.

Global Problem management processes

PB.png

This process aims to address problems with cause and location still unknown.

  1. It will be initiated by Router operator, maybe triggered by Grid data contacts (low throughput experienced etc). Grid Data contact will be kept informed by Router operators.
  2. After taking an overview of current state of the LHCOPN (look at monitoring, ticket's dashboard...) and being sure another problem is not yet running and reported the incident management process will be started
  3. We go on a top-down approach, starting at L3 (routers, IP, BGP, filtering...) with L3 incident management process
  4. If unsuccessful we go at L2 (dark fibres...) with L2 incident management process .This is under the responsibilities of router operators to distinguish between L2 and L3 problems.
  5. If no previous process is able to tackle the issue in a reasonable delay the Escalated incident management process is initiated by the router operator of the site noticing the problem.

Incident management process

Even some incidents still resolved when noticed should be reported (for post mortem analysis, quality assessment and information etc.).

L3 incident management process

IL3.png

Scope: Router down, BGP filtering, bad routing...
The source site is the site where the problem lies.

1.1 A tickets is created on the LHCOPN Heldpesk for reporting by the router operator of the source site. It is assigned to itself, the source site.
1.2 The Router Operator contacts is counterpart on distant site (site-site communication) to know if something goes wrong (power outage...). If problem is on distant site the distant site will start this process (ticket then re-assigned to distant site).
1.3 If the problem is related to an underlying layer (L2: dark fiber outage...) the router operator will start the L2 incident management process. The router operator will be responsible to manage the trouble with the L2NOC (open and follow NOC's ticket...). He stays responsible for the LHCOPN ticket into GGUS.
1.4 Otherwise the router operator is owning the problem and will contact its local Grid Data contact to report impact. Distant Router operator will also be informed.

2 The LHCOPN TTS notifies all impacted sites about the incident

L2 incident management process

IL2.png

Scope: Dark fibres outages...

1.1 A L2NOC and a router operator could notice a L2 incident. They will interact together to confirm it or not. A router operator could also be warned from the L3 incident management process through a LHCOPN ticket assigned to its site
1.2 If confirmed the router operator of a linked site will put a ticket on the LHCOPN TTS. The router operator is in charge of dealing with involved L2 network providers and to reflect ongoing resolution within the LHCOPN TTS.
1.3 It is the responsibilities of linked and affected sites to warn their Grid data contact.

2 All impacted sites will be notified by the TTS.

3 If nothing if found at L2 the Escalated incident management process is started.

Escalated incident management process

If no previous process is able to tackle an issue (strange versatile performance problem, filtering...), or if the resolution time seem unreasonable this process is initiated by router operator of the site noticing the problem. This process should be started after the maximum delay of one week (i.e at least for any incident lasting for more than one week).

The router operator will perform a phoneconf with all persons of all potential faulty domains involved to agree on the workplan to localise and fix the issue. The precise list of people/organisation to attend depends of the outage and will be chosen by the router operator. The existing ticket into GGUS is updated with outcomes of the phoneconf, and its priority is increased.

Change management process

The change management process tracks and documents major changes occurring on the LHCOPN (infrastructure, routing, filtering,...).

  • A change without impact could be done at any time
  • A change with impact MUST be implemented with a maintenance.
This process is different from maintenances, because we can have maintenance without any change (f.i scheduled power cut by power supplier, fibre needing to be cleaned ...).

Major changes are at least: change in routing, change in filtering, new IP prefixe, fibre change, change of IOS version.

There is no negotiation for changes, if necessary this will be done in the maintenance implementing the change.

To roll back a change there is two possibilities:

  • The roll back is done in the maintenance window: the maintenance is considered not done
  • The roll back has to be done after the end of the maintenance window: Another change process should be started to do the rollback

L3 change management process

CL3.png

Scope: IP addresses change, new prefix propagated, new filtering

The source actor for these changes are router operators.

1.1 Router operator will expose change to its Grid data contact (change in performing, new resiliency possibility ...)
1.2 Router operator will expose change to affected sites (e.g linked sites)

2.1 The change will be fully documented on the change management database and technical information will also be updated.
2.2 DANTE operation may be warned if the change has a impact on the monitoring (new IP to be watched etc.). Site is responsible to ensure and follow update of the monitoring system.
2.3 ENOC may be warned to update L3 BGP monitoring and/or to trigger update of the trouble ticket system. Site is responsible to ensure and follow that.

3 If the change has an impact a L3 maintenance management process will be started to commit and broadcast the change. Link to the full documentation of the change is to be provided (e.g URL to the Global web repository).

If we have some L3 changes impacting the L2 (L3 VPN for instance) the L3 change management process is started as being the major event. If the change has no impact it could be silently done but has to be accurately documented.

L2 change management process

CL2.png

This is a complex process as the lower you go the most you impact. A L2 change could have an impact at L3 (new IP addresses for a new link...) but everything is done into the L2 change management process as being the root event.

Scope: New LHCOPN L2 link, L2 link with new physical path, change of L2 network provider for a segment...

The source for L2 changes are L2 network providers.

1.1 The L2NOC send its change to router operators of affected sites
1.2 Router operator expose changes and impacts to its Grid data contact
1.3 Router operator expose changes and impacts to router operator of impacted sites

2.1 The change will be documented by router operator on the global web repository and some technical information should also be changed
2.2 DANTE operation may be warned if the change has a impact on the monitoring (new IP to be watched etc.). Site is responsible to ensure and follow update of the monitoring system.
2.3 ENOC may be warned to update L3 BGP monitoring and/or to trigger update of the trouble ticket system. Site is responsible to ensure and follow that.

3 If the change has an impact a L2 maintenance management process will be started to commit changes. Else the change could be silently done but always accurately documented.

The Backup test process should be done whenever new possibility for resiliency is possible to validate it and to ensure nothing else is affected.

Maintenance management process

L3 maintenance management process

ML3.png

Scope: scheduled power outage on site, router IOS upgrade, ...

1.1 The router operator on source site try to find a suitable date with its local Grid Data contact
1.2 The date could also be negotiated - off the record - with all sites that could be affected by the maintenance (e.g linked sites)

2 A ticket is created into the LHCOPN TTS by the router operator of the source site
3 All affected sites are notified by the LHCOPN TTS
4 The maintenance is performed and the LHCOPN TT is updated. Updates are broadcasted to all impacted sites. It ends when LHCOPN TT is closed.

There is no public negotiation phase.

The notice window to announce maintenance should be according to the impact:

Impact duration Notice window
More than 1 hour 1 week
Less than 1 hour 2 days
No impact 1 day

This is compliant with the WLCG rules for scheduled downtimes.

A maintenance put into the TTS and broadcasted is silently accepted after one day (2 hours if it has no impact). All delays are expressed in the working hours 09:00 to 16:00 UTC. All days are considered worked. Emergency maintenances are allowed but should not be a common thing.

Even maintenances without impact should be put on the TTS (maintenance at risk for instance).

L2 maintenance management process

ML2.png

Sources for L2 Maintenance are L2 network providers (optical transmitter to be changed, fibre physically rerouted, fibre to be cleaned...)

Often we will not have negotiation phase for L2 maintenance with L2 network providers. But if an event is really disturbing this should be tried.

1.1 The L2NOC will send its maintenance to connected or affected Router operators. The first noticed router operator start this process.
1.2 The router operator will warn its Grid data contact (and may check with him date is ok)
1.3 The router operator may check with distant affected sites - off the record - that the date is suitable
1.4 If a disturbing overlapping event is found we should try to negotiate another date with the network provider and we restart at step 1.1 . Else the maintenance is posted in the LHCOPN TTS by the router operator.

2 All impacted sites are notified.

3 The maintenance is performed and the LHCOPN TT is updated. Updates are broadcasted to all impacted sites. It ends when LHCOPN TT is closed.

Handling Multi Hop troubles

MHT.png

Problem example:

  • Site 1 unables to reach site 3 but ables to reach site 2
  • Site 2 ables to reach site 3

Proposed handling:

  • L3 problem assigned by site 1 to site 3
  • If no resolution, site 1 reassigns it to site 2

Benefits:

  • Keep only one ticket per trouble enabling serialisation of trouble resolution
  • Problem’s responsibility transfered with ticket’s re-assignment
  • Initiator follows trouble

Responsibilities

  • Outages on links between T0 and T1 are of responsibility of T1s (who ordered the link)
  • Responsibility for outages on T1-T1 links are being studied (should be mapped from existing contract by studying costs model: who pays what, where).
  • Responsibility for GGUS' ticket is on the site which the ticket is assigned to.

LHCOPN Operational Working group

Contact

This page is maintained by the LHCOPN operational working Group. It can be reached at project-lhcopn-opswg@cernNOSPAMPLEASE.ch .
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng AIRM.png r5 r4 r3 r2 r1 manage 54.2 K 2008-12-16 - 12:19 GuillaumeCessieux  
PNGpng Actors.png r5 r4 r3 r2 r1 manage 30.9 K 2008-12-16 - 12:10 GuillaumeCessieux  
PNGpng CL2.png r7 r6 r5 r4 r3 manage 67.4 K 2009-04-07 - 06:50 GuillaumeCessieux  
PNGpng CL3.png r8 r7 r6 r5 r4 manage 68.1 K 2009-04-07 - 06:42 GuillaumeCessieux  
PNGpng DC.png r1 manage 62.8 K 2008-07-21 - 11:01 GuillaumeCessieux  
PNGpng IA.png r4 r3 r2 r1 manage 82.2 K 2008-12-16 - 12:25 GuillaumeCessieux  
PNGpng IL2.png r7 r6 r5 r4 r3 manage 47.7 K 2008-12-16 - 12:37 GuillaumeCessieux  
PNGpng IL3.png r7 r6 r5 r4 r3 manage 46.5 K 2008-12-16 - 12:32 GuillaumeCessieux  
PNGpng MHT.png r1 manage 9.5 K 2008-07-23 - 14:38 GuillaumeCessieux  
PNGpng ML2.png r7 r6 r5 r4 r3 manage 49.8 K 2008-12-16 - 12:48 GuillaumeCessieux  
PNGpng ML3.png r6 r5 r4 r3 r2 manage 44.2 K 2008-12-16 - 12:45 GuillaumeCessieux  
PNGpng PB.png r5 r4 r3 r2 r1 manage 49.4 K 2008-12-16 - 12:26 GuillaumeCessieux  
Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2010-11-24 - GuillaumeCessieux
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCOPN All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback