Zivame runbook for the alerts
Below are the alerts which we get regularly.
1) 5xx alarms:-
[edit]a) Please check the logs 24/7.
b) Go to browser and access https://go.zivame.org/apilogs
c) Modify the time accordingly. It will on top right
d) It works with query. You can write the query in the box which is in the middle. You can get the logs by running the queries. Queries will be like below. You can modify it accordingly.
Examples of queries:
a) logtype="Java API Access Logs" and responsecode=500 and requesturl NOTCONTAINS "marketplace"
b) logtype="Java API Access Logs" and responsecode=500 and requesturl NOTCONTAINS "marketplace" groupby machineip1
c) logtype="Node FE Access Logs" and requesturi NOTCONTAINS "market place”
d) logtype="Node FE Access Logs" and requesturi NOTCONTAINS "market place" groupby remotehost
Note:- Replace the values accordingly.
If this is occurring frequently, you can tag the entire team.
2) Cron alerts:-
[edit]When it is failing, first check why it's failing. Goto the failed part and click on it and you will see the reason. Now take the screenshot or copy it and paste in the group. Inform the team that this is failing due to that error. If you are confident about the solution by observing the error you can also mention the solution.
For example:- if the job is already running, the next cron schedule is triggered and it fails, then you can tell that this is due to time conflict.
Inform the team. Please check the below image for understanding.
3) DB Conections Alarm:-
[edit]If you are getting this alarm, Click on the alarm and check the graph for the connections count and monitor it. If it stays for long-time and the count is increasing then you need to inform the team in the group. You can tag the person accordingly.
Just for your knowledge sharing the below information.
The DatabaseConnections metric determines the number of database connections in use. For an optimal workload, the number of current connections should not exceed approximately 80% of your maximum connections.
The max_connections parameter determines the maximum number of connections permitted in Amazon RDS.
The default value of this parameter depends on the total RAM of the instance. For example, for Amazon RDS MySQL instances, the default value is derived by the formula {DBInstanceClassMemory/12582880}.
Just for your knowledge you can see the link below. You will get an idea.
We are having db.m5.2xlarge instance class type.
https://sysadminxpert.com/aws-rds-max-connections-limit/ — Max-connections per class.
If you are having access to the resources.
a) Login to the replica or master with the below command
mysql -h hostname -P 3306 -u username -ppassword
Note:- Please replace the hostname, username and password accordingly.
b) Check the connections with the below query.
show status where `variable_name` = ‘Threads_connected';
c) Also check the queries which are running.
show processlist;
show full processlist;
4) Down alert:-
[edit]You will get the down alert from 24/7. When you get the alert check for the instance which is down. Inform the team that it is down.
Note:- This is just FYI. Few instances are in spot.io with scheduling enabled. These are the spot instances. For example if we set up scheduling to stop at 9:00pm and start at 8:00 am, instances will act accordingly. If the monitor is added in 24/7 and When these instances stop you will get a down alert in 24/7. If you are having access to AWS EC2 instances then copy the instance Id and check for the instance whether it is available or not. If it's not available, you can suspend the monitor in 24/7. Next day, the instance will start with a new instance Id.
Either it is on the spot or AWS when you get a down alert you need to inform the team.
5) Write IOPS alarms:-
[edit]When you get this alarm, inform the group regarding the alarm.
Just for your knowledge sharing the below information.
Storage type and size govern IOPS allocation in Amazon RDS SQL Server, Oracle, MySQL, MariaDB, and PostgreSQL instances. With General Purpose SSD storage, baseline IOPS are calculated as three times the amount of storage in GiB. For optimal instance performance, the sum of ReadIOPS and WriteIOPS should be less than the allocated IOPS.
If you are having access to the resources then follow the below steps.
a) Login to the resource which are getting the alarm with the below command
mysql -h hostname -P 3306 -u username -ppassword
Note:- Please replace the hostname, username and password accordingly.
b) Run the below query and check for the write queries which are taking time. Inform the team regarding the queries.
show processlist;
show full processlist;
6) Read IOPS alarms:-
[edit]When you get this alarm inform in the group regarding the alarm.
If you are having access to the resources then follow the below steps.
a) Login to the resource which are getting the alarm with the below command
mysql -h hostname -P 3306 -u username -ppassword
Note:- Please replace the hostname, username and password accordingly.
b) Run the below query and check for the write queries which are taking time. Inform the team regarding the queries.
show processlist;
show full processlist;
7) Replication Lag:-
[edit]When you get this alarm, inform the team about the alarm.
Just for your knowledge sharing the below information.
This means that sometimes, the replica isn't able to keep up with the primary DB instance. As a result, replication lag can occur.
When you use an Amazon RDS for MySQL read replica with binary log file position-based replication, you can monitor replication lag.
The ReplicaLag metric reports the value of the Seconds_Behind_Master field of the SHOW SLAVE STATUS command.
The Seconds_Behind_Master field shows the difference between the current timestamp on the replica DB instance. The original timestamp logged on the primary DB instance for the event that is being processed on the replica DB instance is also shown.
a) If you have the access to the resources, login to the resources with the below command
mysql -h hostname -P 3306 -u username -ppassword
b) Identify which replication thread is lagging. Run the SHOW MASTER STATUS command on the primary DB instance, and review the output: Below is the sample taken from AWS document.
Note: In the example output, the source or primary DB instance is writing the binary logs to the file mysql-bin.066552.
c) Run the SHOW SLAVE STATUS command on the replica DB instance and review the output:
In Example 1, the Master_Log_File: mysql-bin.066548 indicates that the replica IO_THREAD is reading from the binary log file mysql-bin.066548. The primary DB instance is writing the binary logs to the mysql-bin.066552 file. This output shows that the replica IO_THREAD is behind by 4 binlogs. However, the Relay_Master_Log_File is mysql-bin.066548, which indicates that the replica SQL_THREAD is reading from the same file as the IO_THREAD. This means that the replica SQL_THREAD is keeping up, but the replica IO_THREAD is lagging behind.
Example 2 shows that the primary instance's log file is mysql-bin-changelog.066552. The output shows that IO_THREAD is keeping up with the primary DB instance.
In the replica output, the SQL thread is performing Relay_Master_Log_File: mysql-bin-changelog.066530. As a result, SQL_THREAD is lagging behind by 22 binary logs.
Normally, IO_THREAD doesn't cause large replication delays, because the IO_THREAD only reads the binary logs from the primary or source instance. However, network connectivity and network latency can affect the speed of the reads between the servers. The IO_THREAD replica could be performing slower because of high bandwidth usage.
If the replica SQL_THREAD is the source of replication delays, then those delays could be caused by the following:
Long-running queries on the primary DB instance
Insufficient DB instance class size or storage
Parallel queries run on the primary DB instance
Binary logs synced to the disk on the replica DB instance
Binlog_format on the replica is set to ROW
Replica creation lag
c) Check for the queries which are taking a long time with the below command.
show processlist;
show full processlist;
Long-running queries on the primary DB instance that take an equal amount of time to run on the replica DB instance can increase seconds_behind_master. For example, if you initiate a change on the primary instance and it takes an hour to run, then the lag is one hour. Because the change might also take one hour to complete on the replica, by the time the change is complete, the total lag is approximately two hours. This is an expected delay, but you can minimize this lag by monitoring the slow query log on the primary instance. You can also identify long-running statements to reduce lag.
After checking all the above, inform the team about the findings.
8) Unhealthy host count:-
[edit]If you find any unhealthy host count then report immediately in the group.
Inform Devops team regarding the issue.
If you are having access or permission to the ELB, check which instances are not in service.
a) go to Loadbalancers section and select the Loadbalancer.
b) Select the instances tab and see which instances are not in service.
c) Remove them from the loadbalancer.
d) If you have the access to the servers you can check the application service status, application logs and Nginx logs. If you don’t have the access then Devops person or Backend team or FE team will troubleshoot further.
9) High CPU utilization:-
[edit]When you get this alert, inform the same in the group and If you are having the access to the instances please follow the below steps.
a) run the command top or htop or sar
b) Check the CPU utilization, process statistics. Inform the team which process is causing this.
c) Monitor for sometime. If its required to kill the process then Kill it. Or If you are facing this issue frequently then you need to upgrade the instance type to the next level.
10) High Memory utilization:-
[edit]When you get this alert, inform the same in the group and If you are having the access to the instances please follow the below steps.
run the command free -m or free -h and check the memory usage.
Run the top or htop command and check the memory utilization, process statistics. Inform the team which process is causing this.
Monitor for sometime. If its required to kill the process then Kill it. Or If you are facing this issue frequently then you need to upgrade the instance type to the next level.
11) Latency Alerts:-
[edit]Check the graph. If it is FE Inform the FE team, else inform the backend team. Monitor and follow up with the team.
12) SQS alerts:-
[edit]When you get this alert. Inform the team. AWS consumer service needs to be restarted in “return cron prod” server. After the restart it should come down. Sometimes it will take time to come down, we need to wait or else try restarting again.
