Show Children Hide Children

Main FTS Pages
FtsRelease22
Install
Configuration
Administration
Procedures
Operations
Development
Previous FTSes
FtsRelease21
FtsRelease21
All FTS Pages
FtsWikiPages
Last Page Update
GavinMcCance
2008-09-15

This page is obsolete, please go to FtsInterventions.

Completed FTS interventions at CERN.

For a list of upcoming or ongoing interventions: FtsTier0ServerInterventions.

Date Actioned By FTS Instance Details
2008-01-28 GavinMcCance prod, tiertwo, pilot Updated to patch 1589.
2007-10-31 SteveTraylen prod, tiertwo, pilot 9 host certificates updated.
2007-10-30 SteveTraylen tiertwo CERN-KI and KI-CERN GUC_TIMEOUT increased to 5200 seconds.
2007-10-30 SteveTraylen prod, tiertwo Update of services.xml file. 439-454 srms
2007-10-30 SteveTraylen tiertwo Addition of CERN-KI, and CERN-SINP and their inverses.
2007-10-11 SteveTraylen prod remove httpg://cmssrm.fnal.gov:8443/srm/managerv2 from services.xml. There is a bug, this must be maintained by hand until bug is fixed.
2007-10-10 SteveTraylen prod, tiertwo, pilot (FTA_TYPEDEFAULT_SRMCOPY_GUC_MAXTRANSFERS, FTA_TYPEDEFAULT_URLCOPY_GUC_MAXTRANSFERS) increased form (40,100) to (300,300). Should help CERN-FNAL
2007-10-08 SteveTraylen tiertwo Addition of CERN-JINR, CERN-PROTOVINO and opposites to tiertwo
2007-09-17 SteveTraylen tiertwo and prod FtsTier0ServerInterventionPlanPatch1232 completed

Redistribution of Prod FTS Agents

Objective:
  • prod-fts-ws.cern.ch transfer agents are overloaded.
  • Will migrate CERN-BNL, CERN-IN2P3, BNL-CERN, SARA-CERN, CERN-RAL, PIC-CERN, CERN-TRIUMF, CERN-ASCC from fts110 and fts111 to fts112

Service: CERN Production FTS export service - prod-fts-ws.cern.ch
StartDate: 08:00 UTC (10:00 CEST), July 20th 2007
Duration: one hour
Impact: Service Dedgregation

From 10:00 CEST Friday July the 20th the following FTS channels will 
be paused while they are transfered to new hardware. During this 
time the FTS will continue to accept new jobsand will queue them
for execution after the migration.

CERN-BNL, CERN-IN2P3, BNL-CERN, SARA-CERN
CERN-RAL, PIC-CERN, CERN-TRIUMF and CERN-ASCC 

The migration is expected to last one hour. No further broadcast will be sent
upon successful completion of the migration within the hour.

  • Set nodes to SMS maintenance status.
  • Mark relevant channels as inactive.
  • Wait for existing transfers to complete.
  • Stop relevant channel agents.
  • Reconfigure with CDB.
  • Reconfigure channel agents.
  • Start migrated channel agents.
  • Set nodes to SMS production status.

Upgrade of production tier-0 export to FTS 2.0

Scope:

  • The production T1 export service and production T2<->T1 service
  • The tier-2 production service will not be upgraded at this point.
  • The pilot service is already running FTS 2.0.

This has been sent as a broadcast to the CERN MOD for the CERN IT board and is also entered in the GOCDB.

Services:  prod-fts-ws.cern.ch - CERN T0 export FTS service
                tiertwo-fts-ws.cern.ch - CERN T0<->T2 FTS service
Duration: Monday June 18th 09:00 CEST (07:00 UTC) -> 12:30 CEST (10:30 UTC)
Impact:    The services will be unavailable for VOs ALICE, ATLAS, CMS, LHCB, DTEAM and OPS.


The CERN  T0 export FTS, prod-fts-ws.cern.ch and the CERN T0<->T2 sevrice, tiertwo-fts-ws.cern.ch
to be upgraded to  FTS v2.0 Monday  June 18th.

It is anticipated that service should be restored by 12:30 CEST. Any delay in 
this will result in another announcement.

During this time both services will be completely unavailable.

The pilot service pilot-fts-ws.cern.ch may also be unavailable at this time during the upgrade.

For questions: fts-support@cern.ch

Scope:

  • Production tier-0 export service
  • Production tier-2 service

Preparation steps:

  • Verify that the FTA agent actuator is disabled when the nodes are in maintenance. VERIFIED
  • Only two CDB templates need updating pro_system_gridfts and pro_type_gridfts_slc3. These are now in ~straylen/fts-upgrade and have been validated at CDB level.
  • The primary schema upgrade script is in the transfer-fts FTS 2.0 RPM: /opt/glite/etc/glite-data-transfer-fts/schema/oracle/oracle-upgrade_2.2.1-3.0.0.sql
  • The history schema upgrade script is in /afs/cern.ch/user/m/mccance/public/fts20-upgrade-intervention/fts_history_tables-upgrade_2.2.1-3.0.0.sql

Migration steps:

  • Switch all channels to Inactive. DONE
  • Go to coffee while they drain currently running transfers. DONE
  • Put all production nodes in maintenance. DONE
  • There are three DBMS user jobs running: stop them (SQL*Plus on lcg_fts_prod):
    • exec fts_stats.stop_hourly_job; DONE
    • exec fts_history.stop_job; DONE
    • exec fts_statecount.stop_job. DONE
    • Verify that select * from user_jobs; returns no rows. DONE
  • Stop the web-services (fts101, fts114, fts115). DONE
  • Stop the agent daemons (fts110, fts111, fts112, fts113). DONE
  • Stop the multitude of little scripts running on the FTS monitoring node (fts102). DONE Move to /cron.d/
  • Ask DB team (contact Miguel Anjo) to copy the partial schema to the backup account. This should take around 20 minutes. DONE
  • ... [upgrade software] DONE
  • ... [upgrade CDB yaim configuration for FTS2.0]. Backup the old one. DONE
  • BACKOUT 1
  • Upgrade the main schema (this should take around 2 minutes) DONE
  • Upgrade the history schema (this should take around 20 minutes) DONE
  • Load the delegation schema (YAIM will insist anyway). DONE
  • Run the writer account script to build new synonyms and make the appropriate grants: FtsServer20WriterAccount. DONE
  • BACKOUT 2

Cleanup:

  • Restart the web-services (fts114, fts115). DONE
    • Test a few commands. DONE
  • Restart the agent daemons (fts110, fts111, fts112, fts113). DONE
  • Restart the monitoring scripts on fts102.
  • Re-enable jobs:
    • exec fts_history.submit_job; DONE
    • exec fts_stats.submit_job; DONE
    • exec fts_statecount.submit_job. DONE
  • Apply the "Start the DBMS job" procedure from FtsAdminTools15 for both of these jobs, to start them off. DONE

Test:

  • Check transfers are running on agent nodes. DONE
  • BACKOUT 3.
  • Announce service is back. DONE

BACKOUT 1 - "the software install went horribly wrong"

  • Revert the CDB templates from backup
  • Put back old RPMS and re-run ncm-yaim
  • Go to "Cleanup".

BACKOUT 2 - "the schema upgrade went horribly wrong"

  • Contact Miguel Anjo. Revert partial schema from backup account.

BACKOUT 3 - it doesn't work.

  • try to fix it smile
  • Stop all daemons as before.
  • Apply BACKOUT 2 to revert schema.
  • Apply BACKOUT 1 to revert configuration.

Fallout: Now upgrade is complete there are some things that were noticed during upgrade that need tidying up.

  • Finish disabling STAR and T2 channels on prod service. DONE
  • ncm-yaim component needs to support FTS2 and FTA2 target. DONE
  • tiertwo service needs to have log archiving enabled. DONE
  • Test reboot and reinstall of pilot service. DONE but shutdown of web service need doing
  • Online rebuild of idx_report_file index. ONLINE index rebuild affects performance really badly DONE
  • Still to restart monitoring daemons on fts102 - do after index build is complete.
  • Understand fragmentation of tables. ONGOING
  • Switch of R-GMA gin again. DONE

There has been some issues noticed on the new FTS 2.0 service. These are tracked in Fts20Tier0ServiceIssues.

Deployment on new hardware and split export service from tier-2 service

Current situation:

  • fts101 - Channel agent for CERN-FNAL, ASCC-CERN, BNL-CERN, CERN-ASCC, CERN-BNL, CERN-DESY, CERN-INFN, CERN-PIC, CERN-TRIUMF, DESY-CERN, FNAL-CERN, INFN-CERN and TRIUMF-CERN.
  • fts102 - Channel agent for CERN-CERN, CERN-GRIDKA, CERN-IN2P3, CERN-NDGF, CERN-RAL, CERN-SARA, GRIDKA-CERN, IN2P3-CERN, NDGF-CERN, RAL-CERN and SARA-CERN.
  • fts105 - VO Agents
  • fts107 - Channel agents for T2->T0 transfers.
  • fts103, fts104, fts108 - Webservice

The migration will achieve.

  • Production tier1 (PT1) FTS service for T0->T1. prod-fts-ws.cern.ch
    • There will be no downtime to the production tier1 service.
    • The production tier1 FTS service will be managed by Quattor as is currently not the case.
    • The production tier1 service will be running on new hardware
    • The T2->T0 service will no longer be part of the production tier1 service and will have its own service.

  • Production tier2 (PT2) FTS service for T2->T0. prod-t2-fts-ws.cern.ch
    • There will be a new FTS endpoint for this service. prod-t2-fts-ws.cern.ch
    • This will no longer be part of the tier1 service service.
    • The service will be managed by quattor.
    • There will be a completely new database account for this service.

Preparation Steps:

  • Quattor deploy with SMS maintenance mode switched on.
    • New PT1 FTS web-servers, fts114, 115.
      • Verify that FTS submission work using these canonical host names.
      • Verify that firewall settings are correct for these.
      • Verify that resource BDIIs are populated.
    • New PT1 VO agents on fts113 with incorrect DB password.
    • New PT1 Channel Agents on fts110, 111 and 112 with incorrect DB password. (fts110 should at first run the T2 channels)
  • Follow procedures, DnsAliases, to create an load balanced alias prod-t2-fts-ws.cern.ch for the PT2 service.
  • Request new database account for the PT2 service.

Migration Steps:

  • Expand PT1 aliases to include new web services. Complete.
    • Enable production mode within SMS for fts114 and fts115. Complete.
  • Remove old web-servers fts103, 104 and 108 from PT1 aliases. Complete.
    • Enable maintenance mode for these old web-services. Complete.
  • Migrate all production agents to new hardware. Complete
    • Drain all vo agents on fts105 and place them on fts113. Complete
    • Drain all channel agents on fts101, fts102, fts107(t2) and place them on fts110, fts111, fts112(t2). Complete
  • Fix Things on Production System that We Forgot to Migrate
    • Archive the transfer-url copy logs in /var/tmp/glite-url-copy-edguser/ Complete
    • Tomcat logrotate needs adding. Complete
    • FTA_WRONG alarm needs to be enabled. Complete
    • FTA_STUCK alarm needs to be enabled. Complete
    • my-proxy config needs to be done. Complete
  • Redeploy fts105 as PT2 tier2 channel agents.
  • Redeploy fts106 as PT2 tier2 vo agents. Complete
  • Redeploy with Quattor fts103 and fts104 as the new PT2 web-service. (once alias has been done). Complete
    • Enable production mode for these web-services. Complete
    • Test that the PT2 service is operational. Complete
  • Advertise the new end point for the T2 transfers. Complete
  • Close down T2 on prod.
    • Kill T2 channel agents services on fts112
    • Kill T2 VO agents services on fts113
  • Spread T1 channel agents from fts110, fts111 to include the now empty fts112.
  • Done

Upgrade servers to patch 912

Rolling intervention to upgarde FTS to patch 912. Pick up new host certificates.

Date planned: Thursday 30th 2006

GMOD announcement made 29th November.

Status: done.

Steps:

  • Add new RPMS into CDB from PPS repository
  • Generate new Quattor templates from script
  • Update Quattor templates in cdbop
  • Update RPMs on all machines (spma)
  • Take fts103 out of LB alias (set it to sms maintenance) [ wait 5 mins ]
  • Reconfig fts103 with yaim FTS - it will complain abot schema.
  • fts103: Run suggested schema patch.
  • fts103: Rerun yaim FTS (auto restart)
  • Test 103 explicitly from CLI
  • BACKOUT 1
  • Add 103 back into alias (set it to sms default), remove 104 [ wait 5 mins ]
  • Reconfig 104 with yaim FTS (auto restart)
  • Test 104 explicitly from CLI
  • Add 104 back into alias, remove 108 [wait 5 mins]
  • Reconfig 108 with yaim FTS (auto restart)
  • Test 108 from CLI
  • Add 108 back in.
  • On 101, 102, 105, 106
    • Rerun yaim for FTA, and restart FTA services
  • Check 101, 102, 103, 105, 106 are starting new jobs. Check logs for problems.
  • BACKOUT 2

BACKOUT1:

  • Replace original templates in cdbop
  • Run spma on all nodes
  • Restart fts103 server node
  • Set fts103 to sms default
  • Schema change is new indicies only Suggest that these should be kept regardless (validated on pilot already).

BACKOUT2:

  • Back out FTA updates in cdbop. Keep FTS templates in cdbop.
  • Run spma on all nodes
  • Restart transfer and VO agents on 101, 102, 105, 106

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2008-09-15 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback