DevOps Classroom Series – 14/Sept/2021

Story of an Organization

  • LearningThoughts (a ficticious company) is having an application lt-hrms which is a human resource management system and is used by multiple organizations

  • For this Learning Thoughts is hosting the application lt-hrms and they have the following team Preview

  • Customers of LT are Preview

  • The architecture of the application is as follows Preview

  • Learning Thoughts updates a new release in every two weeks

  • It is often observed that the following issues occur randomly

    • functionality stops working
    • Some server disks get filled up and there will no space left so servers donot respond
    • In Some server CPU utilization is above 95% all the time and users experience slow/unresponsive behavior
    • And many more…
  • During Peak times, some users are facing request timed out errors etc.

  • Hardware/Network failures happen randomly.

  • Now We need to find a solution

    • to stop as many failures as possible from occuring
    • In the case of failures to resolve as early as possible
  • We need to have monitoring in place to

    • monitor systems (Whether they are up or not)
    • monitoring health of your application
    • Monitor system resources
      • CPU
      • Memory
      • Storage
      • Network
  • Applications generally create logs which donot have any standard approach. Reading text is tricky and creating meaningful information from text is quite difficult, so in majority of the case, we would use humans to find issues by going through logs

  • So learning thoughts have decided that they would use a monitoring system which can not only read metrics but also parse log files and also helps in finding error patterns in logs.

  • We would try to understand on resolving/identifying/trouble shooting failures with Elastic Stack (Which can do monitoring, APM, Log parsing, Alerting…)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About learningthoughtsadmin