Agave Watchtower
Agave Watchtower is an extremely useful monitoring tool that will regularly monitor the health of your validator. It can monitor your validator for delinquency then notify you on your application of choice: Slack, Discord, Telegram or Twilio. Additionally,agave-watchtower has the ability to monitor the health of the entire cluster so that you can be aware of any cluster wide problems.
Getting Started
To get started with Agave Watchtower, run:The command will monitor your validator, but you will not get notifications unless you added the environment variables mentioned in
agave-watchtower --help.Best Practices
In the case that you runagave-watchtower on the same computer as your agave-validator process, then during catastrophic events like a power outage, you will not be aware of the issue, because your agave-watchtower process will stop at the same time as your agave-validator process.
Additionally, while running the agave-watchtower process manually with environment variables set in the terminal is a good way to test out the command, it is not operationally sound because the process will not be restarted when the terminal closes or during a system restart.
Setup Telegram Notifications
To send validator health notifications to your Telegram account:Come up with a name for the bot. The only requirement is that it cannot have dashes or spaces, and it must end in the word
bot. Many names have already been taken, so you may have to try a few.Once you find an available name, you will get a response from @BotFather that includes a link to chat with the bot as well as a token for the bot. Take note of the token.
Find the bot in Telegram and send it the following message:
/start. Messaging the bot will help you later when looking for the bot chatroom id.In Telegram, click on the new message icon and then select new group. Find your newly created bot and add the bot to the group. Next, name the group whatever you’d like.
Recall the HTTP API token from @BotFather. The token will have this format:
389178471:MMTKMrnZB4ErUzJmuFIXTKE6DupLSgoa7h4o.Next, you need the chat id for your group. First, send a message to your bot in the chat group that you created. Something like
@newvalidatorbot hello.Next, in your browser, go to
https://api.telegram.org/bot<HTTP API Token>/getUpdates. Make sure to replace <HTTP API TOKEN> with your API token. Also make sure that you include the word bot in the URL before the API token.The response should be in JSON. Search for the string
"chat": in the JSON. The id value of that chat is your TELEGRAM_CHAT_ID. It will be a negative number like: -781559558. Remember to include the negative sign!If you cannot find
"chat": in the JSON, then you may have to remove the bot from your chat group and add it again.Once your environment variables are set, restart
agave-watchtower. You should see output about your validator.To test that your Telegram configuration is working properly, you could stop your validator briefly until it is labeled as delinquent. Up to a minute after the validator is delinquent, you should receive a message in the Telegram group from your bot. Start the validator again and verify that you get another message saying
all clear.Key Metrics to Monitor
Check Gossip
Confirm the IP address and identity pubkey of your validator is visible in the gossip network:Check Balance
Your account balance should decrease by the transaction fee amount as your validator submits votes, and increase after serving as the leader:Check Vote Activity
Thesolana vote-account command displays the recent voting activity from your validator:
Check Validator Status
View all validators and find yours:- Active stake
- Vote credits earned
- Commission
- Last vote
- Root slot
- Skip rate
Monitor Catchup Status
Thesolana catchup command is useful for seeing how quickly your validator is processing blocks:
solana gossip and solana validators.
Check Leader Schedule
To see when your validator is scheduled to be leader:Using JSON-RPC Endpoints
There are several useful JSON-RPC endpoints for monitoring your validator:Get Cluster Nodes
Get Vote Accounts
current vote accounts. If staked, stake should be greater than 0.
Get Leader Schedule
Get Epoch Info
slotIndex should progress on subsequent calls.
Collecting Metrics
It is important to collect metrics: it helps diagnose existing problems and allows you to anticipate future ones.metrics.solana.com
There are several public dashboards available, one of them is hosted at metrics.solana.com. Reporting to the solana.com public dashboard is even required if you participate in the Solana Foundation Delegation Program. Using it is done by simply setting the$SOLANA_METRICS_CONFIG variable in your validator’s environment (e.g. at the beginning of your validator.sh script).
Refer to the available Solana clusters documentation to get the appropriate value of $SOLANA_METRICS_CONFIG for your validator.
Prometheus and Grafana
Many operators set up their own Prometheus and Grafana stack to collect and visualize metrics. The validator exposes metrics on port 8899 by default that can be scraped by Prometheus:Log Analysis
Viewing Logs
If running as a systemd service:Log Output Tuning
The messages that a validator emits to the log can be controlled by theRUST_LOG environment variable. Details can be found in the documentation for the env_logger Rust crate.
Common Log Messages to Monitor
Error Messages
Grep for errors in your logs:Leader Slot Messages
Look for messages indicating your next leader slot:Version Information
Verify the validator version from logs:Performance Optimization
Monitor System Resources
CPU Usage
agave-validator process and check CPU usage. It should be using multiple cores effectively.
Memory Usage
Disk I/O
%util or await times can indicate bottlenecks.
Network Usage
CPU Performance Tuning
If PoH hashes/second rate is slower than the cluster target: Set performance governor:System Clock
Large system clock drift can prevent a node from properly participating in Solana’s gossip protocol. Ensure that your system clock is accurate:Alerting Best Practices
Critical Alerts
Set up alerts for:- Validator delinquency - Most critical, indicates your validator has fallen behind
- Low identity account balance - Prevent running out of voting funds
- High skip rate - Indicates performance issues
- Validator offline - Process crashed or machine is down
- Disk space low - Prevent running out of space
Warning Alerts
Set up warnings for:- Higher than normal skip rate
- Slower catchup speed
- High CPU or memory usage
- High disk I/O wait times
- Network bandwidth saturation
Response Procedures
Document your response procedures for common alerts:- Validator delinquent - Check logs, verify network, restart if needed
- Low balance - Transfer SOL to identity account
- Out of disk space - Clean up old snapshots or ledger data
- High skip rate - Check system resources, network connection