Uptimey

By Steven Goodstein, DevOps Engineer

Uptime is a popular word among system administrators. Their jobs in large part revolve around the word. For those of you not familiar uptime refers to the amount or percentage of time systems remain up. But how do we know if any of our web applications are down in the first place? There are many creative ways but a common and reasonable approach is HTTP monitoring. HTTP monitoring is the process of periodically pinging a website over HTTP or HTTPS to check for the response back.

Knowing if our services are down is critical and we need to know in a timely manner. Every second that a system is down could be costing the business countless amounts of money. With that being said not all HTTP monitoring is the same. Most ping monitors send the request and just check that a status code of 200 (okay) was received back. However we may want to check our services by sending custom headers, payloads and expecting custom responses back. There are many web monitors out there but not all meet this flexibility. This is what led me to build Uptimey.

Uptimey is a flexible HTTP monitoring system that allows you to make advanced checks by sending custom headers, payloads via multiple request method types. In this blog I’ll talk about how I developed this system and the technologies I used.

Building such an application requires a lot of logistical issues when it comes to scaling. What If the site is monitoring hundreds of thousands of checks? It does take some amount of time to make a request and wait for the return. How are we going to handle this load? To build a system like this I chose to use a series of worker apps that can be scaled independently. They each do one small thing and then consume from a queue.

Scaling the apps traditionally would be difficult. Provisioning new servers, and then provisioning new apps onto them and then load balancing between the servers wouldn’t be ideal. However I decided to build this whole system via cloud architecture. With the advancements in containers and orchestration tools like Kubernetes this can be configured, managed and scaled with ease.

In starting to build my application I built the user side of the application. I built this on Rails with a Google SQL database since this would all be living in Google’s cloud. The app’s purpose is to handle all basic tasks with users. It allows users to sign up, log in, add a credit card for a subscription, change plans, and of course add and define checks. In addition a dashboard is built in to help user’s understand their uptime totals. From here I built the billing system. Since this application is a SaaS product we decided to only bill monthly on a subscription. The billing system is a separate application that manages the user’s balance and at the time of their payment date charges the customer.

Now onto the actual service. This was the most interesting part to me. I built four small worker applications that handle different tasks. The first application’s job is to query the database for all checks that meet the time criterion. We offer checks with a frequency as quick as 15 seconds! Therefore every 15 second the app queries all of those checks and any other that would fall under its interval such as a 30 second or 1 minute check. The app then takes all of those checks and puts it into a queue for processing. The second application actually makes the HTTP requests. It constantly loops and attempts to pull from the queue. Since this application is going make every request at least every 15 seconds for all the checks in the database it’s going to be doing a lot of work. We need these checks done in a timely manner and because of the load and timeliness it’s the most importer worker to scale. By adding replicas of these apps pulling off the queue constantly we should be able to handle any load. Now if a check is deemed down, post analysis of the response it’s then sent to another queue. The 3rd application then pulls from the queue and does a double check to ensure the service truly is down. If down it passes it the check to the final queue. The final queue is consumed by an application that handles alerts. If it makes it to this application it’s time to send out alerts to customers. User’s can have both email and SMS alerts. This final application handles updating the checks within the database to down or up and then sends alerts to the user based on the user's preferences.

That’s pretty much it. With a series of 6 applications I’ve built Uptimey, an HTTP monitoring system. Having these small applications was a great approach for me because they could all be scaled independently. The first app making the request will need to be high in scale. But the something like the alert worker won’t as it doesn’t need to be scaled as much since it only sends an alert when something goes down or back up. Using Kubernetes and Google cloud makes this process of scaling simple and even allows for auto scaling parameters. Hopefully you like the post and feel free to sign up to Uptimey yourself (uptimey.com).