Identify dashboards and monitoring tools that can be used to resolve cluster issues

Advanced Pages

CVMIP/DNS:Port

  • 2009: Stargate page to monitor backend storage
    • Check QoS Queue and OpLog QoS queue (admitted/outstanding IO’s)
    • Cache hit rates should be 80-90%+ if workload is ready heavy for best possible performance
    • Check Avg Latency, Avg Op Size, Avg. Outstanding
  • 2009/latency: Stargate page backend latency
  • 2009/vdisk_stats: Stargate page histograms of I/O, latency, writes
    • If high read latency, check read source for vDisk and look where I/Os are served from (more than likely reads coming from HDD)
    • Random/smaller IO (<64K) written directly to OpLog. Larger/sequential IO bypass
    • OpLog and written to extent store (Estore)
    • Ops and Randomness section shows if IO is random or sequential
    • Working Set Size shows last 2 minutes and 1 hour
    • Read source shows where IO’s are being served from
    • Write Destination shows where IO’s are going
    • Extent Group Up-Migration shows data up-migrated in last 300, 3600, and 86400 seconds
Nutanix Advanced Pages
Image credit: https://nutanixbible.com
  • 2009/h/traces: Stargate page monitor activity traces for operations
  • 2009/h/vars: Stargate page monitor varous counters
  • 2010: Curator page for Curator runs
    • Need to be on the Curator master
    • Partial scan every 60 minutes
    • Full scan every 6 hours
    • Also triggered by:
      • Periodic (normal state)
      • Disk/Node/Block failure
      • ILM Imbalance
      • Disk/Tier Imbalance
      • Partial scans have a single MapReduce job
      • Full scans have four
Nutanix Advanced Pages
  • 2010/master/control: Curator page for manually starting jobs
  • 2011: Chronos page monitors jobs/tasks scheduled by Curator
  • 2020: Cerebro page monitors PDs/rep status/DR
  • 2020/h/traces: Cerebro page to monitor activity traces for PD/rep
  • 2030: Main Acropolis page for details about hosts/tasks/networking
  • 2030/sched: Acropolis page with info about VM/resource scheduling/placement decisions
  • 2030/tasks: Acropolis page shows info about Acropolis tasks/state
  • 2030/vms: Information about Acropolis VMs

Cluster Commands

Cluster Status:

cluster status

Local CVM Status:

genesis status

Check Upgrade Status:

upgrade_status

Stop Cluster Service:

cluster stop [Service Name]

Start Stopped Cluster Services:

cluster start #NOTE: This will start all stopped services

Start Single Service:

cluster state [Service Name]

Restart Local Service:

genesis stop [Service Name]

Find Cluster ID:

zeus_config_printer | grep cluster_id

Find AOS Version:

allssh “cat /etc/nutanix/release_version”

Find CVM Version:

allssh “cat /etc/nutanix/svm-version”

Run NCC Health Checks:

ncc health_checks run_all

Logs

All cluster logs:

allssh “cat ~/data/logs/Acropolis.log”

Errors logs:

allssh “cat ~/data/logs/[COMPONENT].ERROR”

Fatal logs:

allssh “cat ~/data/logs/[COMPONENT].FATAL”

Storage Layers and Monitoring

VM Layer

  • Metrics reported by hypervisor for VM
  • Represent performance VM is seeing
  • Indicative of I/O
  • Usage: troubleshooting VM level

Hypervisor Layer

  • Metrics reported by hypervisor
  • Represent performance Hypervisor is seeing
  • Usage: detailed/valuable metrics

Controller Layer

  • Metrics from Nutanix Controllers
  • From CVM’s
  • What front-end is seeing from NFS/SMB/iSCSI or backend operations (balancing, ILM).
  • Should normally match hypervisor layer
  • Usage: show backend operations

Disk Layer

  • Metrics reported by disk devices
  • Pulled from physical disks.
  • Data hitting OpLog or Extent Store.
  • Usage: See how many ops served from cache or disk

Leave a Reply

Your email address will not be published. Required fields are marked *