Completed FTS interventions at CERN.
For a list of upcoming or ongoing interventions:
FtsTier0ServerInterventions.
Date |
Actioned By |
FTS Instance |
Details |
2008-01-28 |
GavinMcCance |
prod, tiertwo, pilot |
Updated to patch 1589. |
2007-10-31 |
SteveTraylen |
prod, tiertwo, pilot |
9 host certificates updated. |
2007-10-30 |
SteveTraylen |
tiertwo |
CERN-KI and KI-CERN GUC_TIMEOUT increased to 5200 seconds. |
2007-10-30 |
SteveTraylen |
prod, tiertwo |
Update of services.xml file. 439-454 srms |
2007-10-30 |
SteveTraylen |
tiertwo |
Addition of CERN-KI, and CERN-SINP and their inverses. |
2007-10-11 |
SteveTraylen |
prod |
remove httpg://cmssrm.fnal.gov:8443/srm/managerv2 from services.xml. There is a bug, this must be maintained by hand until bug is fixed. |
2007-10-10 |
SteveTraylen |
prod, tiertwo, pilot |
(FTA_TYPEDEFAULT_SRMCOPY_GUC_MAXTRANSFERS, FTA_TYPEDEFAULT_URLCOPY_GUC_MAXTRANSFERS) increased form (40,100) to (300,300). Should help CERN-FNAL |
2007-10-08 |
SteveTraylen |
tiertwo |
Addition of CERN-JINR, CERN-PROTOVINO and opposites to tiertwo |
2007-09-17 |
SteveTraylen |
tiertwo and prod |
FtsTier0ServerInterventionPlanPatch1232 completed |
Redistribution of Prod FTS Agents
Objective:
- prod-fts-ws.cern.ch transfer agents are overloaded.
- Will migrate CERN-BNL, CERN-IN2P3, BNL-CERN, SARA-CERN, CERN-RAL, PIC-CERN, CERN-TRIUMF, CERN-ASCC from fts110 and fts111 to fts112
Service: CERN Production FTS export service - prod-fts-ws.cern.ch
StartDate: 08:00 UTC (10:00 CEST), July 20th 2007
Duration: one hour
Impact: Service Dedgregation
From 10:00 CEST Friday July the 20th the following FTS channels will
be paused while they are transfered to new hardware. During this
time the FTS will continue to accept new jobsand will queue them
for execution after the migration.
CERN-BNL, CERN-IN2P3, BNL-CERN, SARA-CERN
CERN-RAL, PIC-CERN, CERN-TRIUMF and CERN-ASCC
The migration is expected to last one hour. No further broadcast will be sent
upon successful completion of the migration within the hour.
- Set nodes to SMS maintenance status.
- Mark relevant channels as inactive.
- Wait for existing transfers to complete.
- Stop relevant channel agents.
- Reconfigure with CDB.
- Reconfigure channel agents.
- Start migrated channel agents.
- Set nodes to SMS production status.
Upgrade of production tier-0 export to FTS 2.0
Scope:
- The production T1 export service and production T2<->T1 service
- The tier-2 production service will not be upgraded at this point.
- The pilot service is already running FTS 2.0.
This has been sent as a broadcast to the CERN MOD for the CERN IT board and is also entered in the GOCDB.
Services: prod-fts-ws.cern.ch - CERN T0 export FTS service
tiertwo-fts-ws.cern.ch - CERN T0<->T2 FTS service
Duration: Monday June 18th 09:00 CEST (07:00 UTC) -> 12:30 CEST (10:30 UTC)
Impact: The services will be unavailable for VOs ALICE, ATLAS, CMS, LHCB, DTEAM and OPS.
The CERN T0 export FTS, prod-fts-ws.cern.ch and the CERN T0<->T2 sevrice, tiertwo-fts-ws.cern.ch
to be upgraded to FTS v2.0 Monday June 18th.
It is anticipated that service should be restored by 12:30 CEST. Any delay in
this will result in another announcement.
During this time both services will be completely unavailable.
The pilot service pilot-fts-ws.cern.ch may also be unavailable at this time during the upgrade.
For questions: fts-support@cern.ch
Scope:
- Production tier-0 export service
- Production tier-2 service
Preparation steps:
- Verify that the FTA agent actuator is disabled when the nodes are in maintenance. VERIFIED
- Only two CDB templates need updating
pro_system_gridfts
and pro_type_gridfts_slc3
. These are now in ~straylen/fts-upgrade
and have been validated at CDB level.
- The primary schema upgrade script is in the
transfer-fts
FTS 2.0 RPM: /opt/glite/etc/glite-data-transfer-fts/schema/oracle/oracle-upgrade_2.2.1-3.0.0.sql
- The history schema upgrade script is in
/afs/cern.ch/user/m/mccance/public/fts20-upgrade-intervention/fts_history_tables-upgrade_2.2.1-3.0.0.sql
Migration steps:
- Switch all channels to Inactive. DONE
- Go to coffee while they drain currently running transfers. DONE
- Put all production nodes in maintenance. DONE
- There are three DBMS user jobs running: stop them (SQL*Plus on
lcg_fts_prod
):
-
exec fts_stats.stop_hourly_job;
DONE
-
exec fts_history.stop_job;
DONE
-
exec fts_statecount.stop_job
. DONE
- Verify that
select * from user_jobs;
returns no rows. DONE
- Stop the web-services (
fts101
, fts114
, fts115
). DONE
- Stop the agent daemons (
fts110
, fts111
, fts112
, fts113
). DONE
- Stop the multitude of little scripts running on the FTS monitoring node (
fts102
). DONE Move to /cron.d/
- Ask DB team (contact Miguel Anjo) to copy the partial schema to the backup account. This should take around 20 minutes. DONE
- ... [upgrade software] DONE
- ... [upgrade CDB yaim configuration for FTS2.0]. Backup the old one. DONE
- BACKOUT 1
- Upgrade the main schema (this should take around 2 minutes) DONE
- Upgrade the history schema (this should take around 20 minutes) DONE
- Load the delegation schema (YAIM will insist anyway). DONE
- Run the writer account script to build new synonyms and make the appropriate grants: FtsServer20WriterAccount. DONE
- BACKOUT 2
Cleanup:
- Restart the web-services (
fts114
, fts115
). DONE
- Test a few commands. DONE
- Restart the agent daemons (
fts110
, fts111
, fts112
, fts113
). DONE
- Restart the monitoring scripts on
fts102
.
- Re-enable jobs:
-
exec fts_history.submit_job;
DONE
-
exec fts_stats.submit_job;
DONE
-
exec fts_statecount.submit_job
. DONE
- Apply the "Start the DBMS job" procedure from FtsAdminTools15 for both of these jobs, to start them off. DONE
Test:
- Check transfers are running on agent nodes. DONE
- BACKOUT 3.
- Announce service is back. DONE
BACKOUT 1 - "the software install went horribly wrong"
- Revert the CDB templates from backup
- Put back old RPMS and re-run ncm-yaim
- Go to "Cleanup".
BACKOUT 2 - "the schema upgrade went horribly wrong"
- Contact Miguel Anjo. Revert partial schema from backup account.
BACKOUT 3 - it doesn't work.
- try to fix it
- Stop all daemons as before.
- Apply BACKOUT 2 to revert schema.
- Apply BACKOUT 1 to revert configuration.
Fallout: Now upgrade is complete there are some things that were noticed during upgrade that need tidying up.
- Finish disabling STAR and T2 channels on prod service. DONE
- ncm-yaim component needs to support
FTS2
and FTA2
target. DONE
- tiertwo service needs to have log archiving enabled. DONE
- Test reboot and reinstall of pilot service. DONE but shutdown of web service need doing
- Online rebuild of
idx_report_file
index. ONLINE index rebuild affects performance really badly DONE
- Still to restart monitoring daemons on
fts102
- do after index build is complete.
- Understand fragmentation of tables. ONGOING
- Switch of R-GMA gin again. DONE
There has been some issues noticed on the new FTS 2.0 service. These are tracked in
Fts20Tier0ServiceIssues.
Deployment on new hardware and split export service from tier-2 service
Current situation:
- fts101 - Channel agent for CERN-FNAL, ASCC-CERN, BNL-CERN, CERN-ASCC, CERN-BNL, CERN-DESY, CERN-INFN, CERN-PIC, CERN-TRIUMF, DESY-CERN, FNAL-CERN, INFN-CERN and TRIUMF-CERN.
- fts102 - Channel agent for CERN-CERN, CERN-GRIDKA, CERN-IN2P3, CERN-NDGF, CERN-RAL, CERN-SARA, GRIDKA-CERN, IN2P3-CERN, NDGF-CERN, RAL-CERN and SARA-CERN.
- fts105 - VO Agents
- fts107 - Channel agents for T2->T0 transfers.
- fts103, fts104, fts108 - Webservice
The migration will achieve.
- Production tier1 (PT1) FTS service for T0->T1. prod-fts-ws.cern.ch
- There will be no downtime to the production tier1 service.
- The production tier1 FTS service will be managed by Quattor as is currently not the case.
- The production tier1 service will be running on new hardware
- The T2->T0 service will no longer be part of the production tier1 service and will have its own service.
- Production tier2 (PT2) FTS service for T2->T0. prod-t2-fts-ws.cern.ch
- There will be a new FTS endpoint for this service. prod-t2-fts-ws.cern.ch
- This will no longer be part of the tier1 service service.
- The service will be managed by quattor.
- There will be a completely new database account for this service.
Preparation Steps:
- Quattor deploy with SMS maintenance mode switched on.
- New PT1 FTS web-servers, fts114, 115.
- Verify that FTS submission work using these canonical host names.
- Verify that firewall settings are correct for these.
- Verify that resource BDIIs are populated.
- New PT1 VO agents on fts113 with incorrect DB password.
- New PT1 Channel Agents on fts110, 111 and 112 with incorrect DB password. (fts110 should at first run the T2 channels)
- Follow procedures, DnsAliases, to create an load balanced alias prod-t2-fts-ws.cern.ch for the PT2 service.
- Request new database account for the PT2 service.
Migration Steps:
- Expand PT1 aliases to include new web services. Complete.
- Enable production mode within SMS for fts114 and fts115. Complete.
- Remove old web-servers fts103, 104 and 108 from PT1 aliases. Complete.
- Enable maintenance mode for these old web-services. Complete.
- Migrate all production agents to new hardware. Complete
- Drain all vo agents on fts105 and place them on fts113. Complete
- Drain all channel agents on fts101, fts102, fts107(t2) and place them on fts110, fts111, fts112(t2). Complete
- Fix Things on Production System that We Forgot to Migrate
- Archive the transfer-url copy logs in
/var/tmp/glite-url-copy-edguser/
Complete
- Tomcat logrotate needs adding. Complete
- FTA_WRONG alarm needs to be enabled. Complete
- FTA_STUCK alarm needs to be enabled. Complete
- my-proxy config needs to be done. Complete
- Redeploy fts105 as PT2 tier2 channel agents.
- Redeploy fts106 as PT2 tier2 vo agents. Complete
- Redeploy with Quattor fts103 and fts104 as the new PT2 web-service. (once alias has been done). Complete
- Enable production mode for these web-services. Complete
- Test that the PT2 service is operational. Complete
- Advertise the new end point for the T2 transfers. Complete
- Close down T2 on prod.
- Kill T2 channel agents services on fts112
- Kill T2 VO agents services on fts113
- Spread T1 channel agents from fts110, fts111 to include the now empty fts112.
- Done
Upgrade servers to patch 912
Rolling intervention to upgarde FTS to patch 912. Pick up new host certificates.
Date planned: Thursday 30th 2006
GMOD announcement made 29th November.
Status: done.
Steps:
- Add new RPMS into CDB from PPS repository
- Generate new Quattor templates from script
- Update Quattor templates in cdbop
- Update RPMs on all machines (spma)
- Take fts103 out of LB alias (set it to sms maintenance) [ wait 5 mins ]
- Reconfig fts103 with yaim FTS - it will complain abot schema.
- fts103: Run suggested schema patch.
- fts103: Rerun yaim FTS (auto restart)
- Test 103 explicitly from CLI
- BACKOUT 1
- Add 103 back into alias (set it to sms default), remove 104 [ wait 5 mins ]
- Reconfig 104 with yaim FTS (auto restart)
- Test 104 explicitly from CLI
- Add 104 back into alias, remove 108 [wait 5 mins]
- Reconfig 108 with yaim FTS (auto restart)
- Test 108 from CLI
- Add 108 back in.
- On 101, 102, 105, 106
- Rerun yaim for FTA, and restart FTA services
- Check 101, 102, 103, 105, 106 are starting new jobs. Check logs for problems.
- BACKOUT 2
BACKOUT1:
- Replace original templates in cdbop
- Run spma on all nodes
- Restart fts103 server node
- Set fts103 to sms default
- Schema change is new indicies only Suggest that these should be kept regardless (validated on pilot already).
BACKOUT2:
- Back out FTA updates in cdbop. Keep FTS templates in cdbop.
- Run spma on all nodes
- Restart transfer and VO agents on 101, 102, 105, 106