Story of an Organization
LearningThoughts (a ficticious company) is having an application lt-hrms which is a human resource management system and is used by multiple organizations
For this Learning Thoughts is hosting the application lt-hrms and they have the following team
Customers of LT are
The architecture of the application is as follows
Learning Thoughts updates a new release in every two weeks
It is often observed that the following issues occur randomly
- functionality stops working
- Some server disks get filled up and there will no space left so servers donot respond
- In Some server CPU utilization is above 95% all the time and users experience slow/unresponsive behavior
- And many more…
During Peak times, some users are facing request timed out errors etc.
Hardware/Network failures happen randomly.
Now We need to find a solution
- to stop as many failures as possible from occuring
- In the case of failures to resolve as early as possible
We need to have monitoring in place to
- monitor systems (Whether they are up or not)
- monitoring health of your application
- Monitor system resources
Applications generally create logs which donot have any standard approach. Reading text is tricky and creating meaningful information from text is quite difficult, so in majority of the case, we would use humans to find issues by going through logs
So learning thoughts have decided that they would use a monitoring system which can not only read metrics but also parse log files and also helps in finding error patterns in logs.
We would try to understand on resolving/identifying/trouble shooting failures with Elastic Stack (Which can do monitoring, APM, Log parsing, Alerting…)