Disable puppet agent

This way upgrade does not happen on a puppet run, and you can do it manually on each node.

This works but affects both production and dev:

[aiadm]$ wassh -l root -c dashboard/elasticsearch 'puppet agent --disable' 

Having a list of hosts will be needed to upgrade them one by one anyway, so just get a list from foreman and do

[aiadm]$ wassh -l root 'node1 node2 ... nodeN' 'puppet agent --disable' 

Configure the upgrade in puppet:

Use repository hg_dashboard. Branch master for production, qa for the development cluster.

Change ES version in hiera

Set elasticserch::version in data/hostgroup/dashboard/elasticsearch.yaml

On minor (1.X -> 1.Y) version upgrade, change the yum repo in code/manifests/elasticsearch/base.pp

-    osrepos::ai121yumrepo { 'elasticsearch-1.3':
-        descr    => 'Elasticsearch repository for 1.3.x packages',
-        baseurl  => 'http://linuxsoft.cern.ch/elasticsearch/elasticsearch-x86_64/RPMS.es-13-el6/',
+    osrepos::ai121yumrepo { 'elasticsearch-1.4':
+        descr    => 'Elasticsearch repository for 1.4.x packages',
+        baseurl  => 'http://linuxsoft.cern.ch/elasticsearch/elasticsearch-x86_64/RPMS.es-14-el6/',

Do a rolling upgrade-restart

ssh root@hostname

If it's a production search node, remove it from the DNS alias

touch /etc/nologin

Wait (5-10 minutes) until nslookup dashb-es no longer lists this host's IP

We have only one search node in dev, so doing this is not useful, but should work the same way (unless the dashb-es-dev DNS alias is not load-balancing, is http://sls.cern.ch/sls/service.php?id=DNSLOADBALANCING still supported? I don't see it in the list).

If it's a data node, migrate the data away

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster":{"routing":{"allocation":{"exclude":{"_name":"-node-hostname-here-"}}}}}}'

Wait (30-60 min) until the data migrates away. Watch at https://dashb-es.cern.ch/_plugin/whatson or https://dashb-es.cern.ch/plugin/head

Without this step there is a moment of red cluster health when the node is stopped as shards that had a primary on this node are assigned a new primary out of their replicas, and a long period of yellow health when new replicas are created to replace data that was on the node.

service elasticsearch stop

I did not notice problems when upgrading a running ES node and then restarting, but just in case it does not hurt to stop before upgrading.

puppet agent --enable && puppet agent -t

Puppet upgrades the package and starts the service

wait for the node to start and join the cluster

/var/log/elasticsearch/clustername.log or /var/elasticsearch/log/clustername.log on the data nodes

curl localhost:9200
curl localhost:9200/_cluster/health?pretty
curl localhost:9200/_nodes/hostname/info?pretty

Or check https://dashb-es.cern.ch/_plugins/head

check node version

curl localhost:9200/_nodes/hostname/info?pretty

If it's a search node, add it back to the DNS alias

rm /etc/nologin

Wait (5-10 minutes) until nslookup dashb-es shows this host's IP again

If it's a data node, enable storing data on it again

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster":{"routing":{"allocation":{"exclude":{"_name":""}}}}}}'

If you did not migrate the data away before restarting, wait (~30 minutes) for data replication to be created until the cluster status is green again.

curl localhost:9200/_cluster/health?pretty

Or check https://dashb-es.cern.ch/_plugin/head or https://dashb-es.cern.ch/_plugin/whatson

Notes

On the previous update I skipped migrating data away before restarting the data node, because I think it was noticeably slower than just restarting the node and waiting for new replicas. I think there is a way to configure more simultaneous shard migrations, if migration too slow we should look into it.

It's possible to migrate data away from several nodes at a time and upgrade them at the same time, but I don't think it will lead to a shorter upgrade time.

By default indexes have 2 replicas, so even without migrating data away it should be safe to restart 2 data nodes from a green cluster state, but it does not actually save time as recreating replicas after a double restart takes double the time. It's better to restart one data node at a time.

The MIG data writing service does not mind if we restart the search node without updating the DNS alias first, (it retries and gets the other IP from the alias). But it's more seamless for everyone to use the /etc/nologin, and compared to moving data around it takes very little time.

We have 3 master nodes. Make sure to never restart 2 masters at the same time, as 2 need to be running for safe master election.

It's probably best not to restart masters if cluster is yellow, but search nodes can be upgraded while waiting for the data nodes to move data.

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2014-12-04 - IvanKadochnikov
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback