MetricClass Migration

This page contains a summary of Lemon exceptions triggered more than 20 times in the reference period grouped by MetricClass. Please note that some alarms could depend on more than one MetricClass.

You will find more information in the following links:

*Note:* Many lemon alarms are built up from past experiences over decades. The problems which were being monitored are not necessarily still occurring so it is good to assess the usefulness of a metric/alarm before implementation.

The Monitoring team is currently investigating the best alternatives for the most common Lemon MetricClass. If you want to discuss any of them or propose your own alternatives, do not hesitate to visit ~collectd in MatterMost.

OS

Lemon Metric Class Status Collectd Plugin Lemon Alarm #1Y Responsible Status
system.processCount In Progress   nscd_wrong 276 Monitoring in progress
http_wrong 76 Monitoring in progress
sendmail_wrong 45 Monitoring in progress
lemonforwarder_wrong 1873 Monitoring not needed
system.partitionInfo DONE DF root_full 8780 Monitoring DONE
nonwriteable_filesystems 1564 Monitoring no direct metric replacement
var_full 469 Monitoring DONE
nfs_full 64 Monitoring should not be marked as OS
boot_full 60 Monitoring DONE
tmp_full 56 Monitoring DONE
log.Parse DONE Tail YUM_error 179065    
VM_kill 37590 Monitoring DONE
file.filecount     YUM_Transactions 7792    
system.swapIO     swap_io 7287913 Monitoring in progress
system.unmountedFilesystems     unmounted_filesystems 1380 Monitoring no direct metric replacement
puppetd.status     puppetd_wrong 6607    
puppetd_run_errors 1453845    
system.loadAvg DONE Load high_load 22567 Monitoring DONE
system.Os     Operating_System 60948 Monitoring no direct metric replacement
system.swapUsed DONE Swap swap_full 6044 Monitoring DONE

HW

Lemon Metric Class Status Collectd Plugin Lemon Alarm #1Y Responsible Status
adaptec.controller     adaptec_unsupported_os_version 193    
adaptec_unsupported_raid_configuration_for_os_version 190    
adaptec_raid_controller_not_found 121    
megaraidsas.controller     megaraidsas_raid_controller_not_found 338    
megaraidsas.physical_drives     megaraidsas_unconfigured_bad_drive 755    
megaraidsas_raid_controller_not_found 338    
megaraidsas_unconfigured_good_drive 32    
blockdevice-drivers.messages     scsi_blockdevice_driver_error_reported 16216    
device_mapper_error_reported 57    
adaptec.physical_drives     adaptec_raid_controller_not_found 121    
adaptec_missing_drive 22    
sasarray.stats     Sasarray_No_Enclosure_Found 2538    
Sasarray_Wrong_Number_Drives 907    
Sasarray_Fan_Problem 130    
Sasarray_Psu_Problem 68    
log.Parse     machine_exception 2231    
nvm_fail 75    
io_error 32    
nmi_received 28    
adaptec.bbu     adaptec_raid_controller_not_found 121    
IPMI.sel     ipmi_power 86741    
ipmi_mem 6137    
ipmi_proc 5667    
adaptec.raid_arrays     adaptec_raid_controller_not_found 121    
megaraidsas.raid_arrays     megaraidsas_raid_controller_not_found 338    
megaraidsas_raid_array_not_optimal 29    
megaraidsas.bbu     megaraidsas_raid_controller_not_found 338    
IPMI.avgrmscnt     ipmi_wrong 19880    
smart.failing     smart_failing 50    
system.partitionInfo     dma_disabled 21    
bonding.status     bonding_wrong 159    
system.loadAvg     ipmi_wrong 19880    
smart.selftest     smart_selftest 361    
system.CPUCount     ipmi_wrong 19880    
IPMI.ping     ipmi_no_contact 609    

App

Lemon Metric Class Status Collectd Plugin Lemon Alarm #1Y Responsible Status
system.processCount In Progress   limd_wrong 1725    
etcd_wrong 1293    
openstack-nova-compute 991    
afsd_wrong 822    
origin_node_wrong 793    
kafka_broker_wrong 381    
eos_mgm_wrong 314    
slurmd 260    
openstack-nova-conductor 247    
openstack-nova-scheduler 179    
openstack-nova-api 176    
openstack-nova-network 123    
eos_fst_wrong 120    
sssd_wrong 115    
tapeserverd_wrong 112    
c2_xrd_wrong 103    
puppetdb_wrong 89    
rabbitmq-server 84    
master_slave_service_wrong 82    
dashboard_consumer_wrong 76    
rmcd_wrong 70    
kibana_wrong 57    
eos_gridftpd_wrong 54    
sbatchd_wrong 25    
eos_mq_wrong 22    
fts_server_wrong 21    
log.Parse DONE Tail TapeDriveDOWN 4508    
hdfssink_priviledged_action_exception 4037    
CVMFSProbe 3336 Luis / Steve In Progress
teeproxy_error 1896    
LoadBalancingUpdateFailed 1566    
sensor_sample 689    
CASTOR_OraErrors 214    
cmsweb_reqmgr2_is_not_responding 169    
cmsweb_das_web_is_not_responding 159    
cmsweb_dbs_is_down 157    
cmsweb_crabserver_is_down 152    
cmsweb_couchdb_is_down 141    
cmsweb_dbsmigration_is_down 137    
cmsweb_dmwmmon_is_down 137    
cmsweb_phedex_web_is_down 136    
cmsweb_phedex_datasvc_is_down 136    
cmsweb_reqmgr2_is_down 135    
cmsweb_phedex_graphs_is_down 133    
cmsweb_sitedb_is_down 132    
cmsweb_t0wmadatasvc_is_down 132    
cmsweb_reqmon_is_down 132    
cmsweb_confdb_is_down 125    
cmsweb_crabcache_is_down 124    
cmsweb_das_web_is_down 122    
cmsweb_mongodb_is_down 119    
cmsweb_popdb_web_is_down 105    
cmsweb_phedex_webapp_is_down 104    
cmsweb_t0reqmon_is_down 104    
cmsweb_victor_web_is_down 104    
dsm_error 91    
cmsweb_das_client_is_down 85    
cmsweb_crabserver_is_not_responding 75    
cmsweb_phedex_graphs_is_not_responding 72    
database_on_demand_dbod_sensor_exceed_restartmax 56    
cmsweb_dqm_dev_agents_is_down 51    
cmsweb_phedex_web_is_not_responding 49    
cmsweb_dqm_dev_web_is_down 48    
cmsweb_phedex_datasvc_is_not_responding 43    
cmsweb_phedex_webapp_is_not_responding 42    
cmsweb_couchdb_is_not_responding 36    
cmsweb_victor_web_is_not_responding 31    
riversink_peerdisconnected 25    
cmsweb_sitedb_is_not_responding 22    
xsls.availability     castor_alice_xsls_not_available 88    
XRDFED_CMS-EU 81    
cfg_unavailable 47    
flume_agent.flumefch     flume_zero_sink_rate 146274    
flume_channel_full 70247    
flume_zero_sink_rate_in_hdfssink 203    
flume_zero_sink_rate_in_essink 118    
flume_zero_sink_rate_in_gw 118    
flume_cert_es_channel_full 63    
flume_cert_es_zero_sink_rate 39    
flume_cert_hdfs_channel_full 22    
system.partitionInfo DONE DF data_full 81    
pool_full 57 Batch Service In Progress
opt_partition_full_err 32    
cvmfs_transaction_full 30    
flume_agent.flumesinkrate     flume_zero_sink_rate 146274    
flume_zero_sink_rate_in_hdfssink 203    
flume_zero_sink_rate_in_essink 118    
flume_zero_sink_rate_in_gw 118    
flume_cert_es_zero_sink_rate 39    
ProcessInfo     diskmanagerd_wrong 67    
file.sslmtime     racmon_log_age 45    
iss_nologin_age_too_old 28    
system.exitCode     eos-server_version_check_fail 1841    
dashb_http_log 1355    
mesos_slave_wrong 185    
file.filecount     flume_dirq_full 2629    
eos_mgm_no_recent_mdlog 34    
kernel_crashdumps 109    
flume_agent.agent     flume_agent_wrong 4682    
kafka.broker     kafka_under_replicated 3096    
kafka_no_messages 606    
alarm.exception     CVMFSProbe 3336    
dbod.monitoringAgg     database_on_demand_ping_timestamp 1580    
db.iptables     iptables_not_running 321    
log.ParseExtract1     EOS_critical-log-catchall_mail 826    
system.threadCount     eos_fst_toomanythreads 35    
url.httpcode     url_down_apache 167    
file.size     es_huge_logfile 37    
dbod.slavePingAgg     database_on_demand_replication_process 210    
rabbitmq.message In Progress   rabbitmq-server-messages 209 Cloud/Luis Pigueiras In Progress
oracle.StandByFlashRecoveryAreaSpaceReclaimableAgg     oracle_standby_flash_recovery_area_space_reclaimable_agg 315    
WhiteExpire     CVMFSWhiteExpire 72    
oracle.TablespacesQuotasAgg     oracle_ts_quotas_agg 58    
es.health     es_cluster_wrong 46    
log.ParseExtract4     afs_fileserver_rescheduled_debug 132    
oracle.StandByMRPAgg     oracle_standby_mrp_agg 72    
rabbitmq.partition In Progress   rabbitmq-server-partition 22 Cloud/Luis Pigueiras In Progress
es.heap_used     es_heap_size_large 23    
openshift.etcd_members_healthy     etcd_not_enough_members 909    
rpm_process_count.all     rpm_stuck 1078    
log.ParseExtract2     too_many_SELinux_AVC_denied 4367    
infiniband.ports     infiniband_port 291    
oracle.TNSServiceConnectivityAgg     oracle_tns_service_connectivity_agg 73    
service-state.status     eosd_service_error 1132    
openshift.node_ready     openshift_node_not_ready 330    
oracle.SQLResponseTimeAgg     oracle_sql_response_time_agg 670    
oracle.ClusterResourceStateAgg     oracle_cluster_resource_state_agg 135    
openshift.etcd_cluster_healthy     etcd_not_healthy 234    
oracle.AverageActiveSessionsAgg     oracle_average_active_sessions_agg 511    
oracle.StandByApplyLagAgg     oracle_standby_apply_lag_agg 108    
eosdisk.fsck     eos_offline_files 130    
CVMFS.Probe In Progress collectd-cvmfs CVMFSProbe 3336 Luis / Steve In Progress
yumstatus.all     yum_broken 1588    
db.ip6tables     ip6tables_not_running 202    
oracle.InstanceTablespacesAgg     oracle_tablespaces_agg 100    
drain.all In Progress   condor_upgrade 7697 Batch Service In Progress
oracle.DatafileStatusAgg     oracle_datafile_status_agg 23    
system.networkInterfaceDropped     packetsDropped 19331    
puppetd.status     puppetd_run_errors 1453845    
oracle.PGAMemoryAbove3GBAgg     oracle_PGA_memory_above_3gb_agg 365    
es.nodes_process     es_nodes_process_ok 283    
zfs.zrep_age     zrep_age 121    
system.swapIO     dbod_swap_io 7160    
rpmdb.verify_db     rpmdb_verify 88    
sssdfunc.id     sssd_id_test 552    
zfs.zpool     zpool 94    
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2018-05-28 - LuisFernandezAlvarez1
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback