DDoS incident report August 18th, 2011

Posted by pieter on August 19th, 2011

Summary

On August 18, 2011 our network was target of a distributed denial of service attack from a large number of hosts in Pakistan and India. The attack started around 18:30 UTC while monitoring coped with degraded performance between 19:00 and 20:20 UTC. After intentionally bringing down our portal in order to raise the check frequency to normal levels things went back to normal and messages queued up for delivery were sent out via the remote gateways.

With help of our hosting provider RackSpace, our team was able to mitigate the attack using blacklists and identify the IP’s being targeted, allowing us to bring back the portal pages. As of writing the attack is still ongoing and showing a 3 to 6-fold increase in our usual traffic pattern. We are continuing to take proactive measures in order to react to possible changes in the situation.

What we have learned so far

DDoS attacks are difficult to control in general, but we’ve learned a lot from these events. The biggest issue was that our fail-over location was not able to activate itself as the core services were still running. We will be investigating how we can improve this situation without causing unnecessary duplicate probes or alarms to be sent out.

Secondarily, we learned that our main portal services are located too close to the core monitoring services in our network, and as such one may affect the other. We’re planning to physically separate these services now, so that we do not have to bring down our portal in the future in order to free bandwidth for the monitoring services.

That said, I want to give a huge thanks to the stand-by team (Kalina, Dimi and Stratos) who greatly helped reducing the impact of the attack so far by working as a team on several different tracks in parallel. I also want to thank RackSpace for the support from their knowledgeable and fanatical support team.

 

Timeline

  • 18:34 UTC Response team was first alerted about reduced connectivity to our systems (30-60% packet loss).
  • 18:46 UTC Contacted RackSpace support.
  • 18:59 UTC RackSpace identified the issue as a DDoS attack from the Pakistan/India region, they added an initial set of /16′s to our blacklist in an attempt to mitigate the attack.
  • 19:20 UTC Continuously adding /24 subnets to our blacklist.
  • 20:01 UTC Discussed placement of an additional protection layer with RackSpace to fence off the attack. But these measures would take would take up to 3 hours to set up.
  • 20:20 UTC Intentionally brought down the portal website to free up resources for core monitoring services.
  • 21:03 UTC Identified the target IP addresses and brought those down.
  • 21:10 UTC Rerouted all services on the identified IP’s elsewhere.
  • 21:10 UTC Verified pending alerts from the last 30 minutes were now being sent out correctly.
  • 21:30 UTC Brought back the web services excluding the targeted IP’s.
  • 22:56 UTC Brought back affected Jabber services and verified XMPP alerts being sent out.
  • 09:15 UTC Fixed a redirect problem on the watchmouse.com domain.

Thanks for your understanding, we will update this post as noteworthy events arrive.

 
Pieter Ennes
Senior Director of Engineering Artificial Monitoring
Nimsoft / CA Technologies (formerly WatchMouse)

Widget Lets Joomla Users Easily Publish Information about Availability and Performance of Critical Services

Posted by admin on June 21st, 2011

Performance transparency is critical for both small and large companies alike, which is why we’re pleased to announce the introduction of a new product feature to our WatchMouse monitoring services today – the WatchMouse Joomla widget!

The new widget enables Joomla users to easily publish their WatchMouse Public Status Page/s within the Joomla CMS system by simply installing an open source component and module. The Joomla component uses the WatchMouse API to download the monitoring results, and push them directly to a Joomla website, letting users display live availability and performance information on their Joomla-built website.

The new WatchMouse Joomla widget allows Joomla users, developers and site designers to:

  • Publish hourly, daily or weekly availability and performance data
  • Display data using a range of maps, charts and graphs
  • Adjust the look and feel using CSS or use a selection of pre-existing styles which can be tweaked

View a live sample Providing a simple way for Joomla users to display the status of their critical services can give any size company immediate transparency with their users. We aim to create and introduce more Public Status Page widgets for organizations like Joomla who are the backbone for millions of websites including Tumblr, WordPress, Blogger and more. A WatchMouse Public Status Page (free to WatchMouse subscribers) is a web page that informs customers on the status of a website or service. It can reduce costly customer service interactions and create goodwill with end users. A Public Status Page shows the current status of a specified selection of online services and can display updates and public announcements for customers. The pages are hosted on the Amazon cloud infrastructure, ensuring that a company’s status pages are highly scalable. It also ensures that status pages continue to be publicly available even if a company’s main site or service is not. To get started:

  • Sign up for a free 30-day trial or log into your existing WatchMouse account
  • Set up a Public Status Page following the instructions published at the bottom of this page
  • Download the widget from our Joomla page
  • Login to your Joomla site and navigate to Extensions -> Install/Uninstall
  • Click Browse, locate the component’s zip file and click the Upload File & Install button
  • Click Browse again and locate the module’s zip file and click the Upload File & Install button
  • Your installation is complete, navigate to Components -> Watchmouse PSP Widget and check our tutorial

WatchMouse Weekly #2: Tweaking Performance Indicators In Public Status Pages

Posted by simone on March 1st, 2011

Setting up a WatchMouse Public Status Page is a simple task performed from the WatchMouse website.  There are also a few nice articles that walk through the whole procedure and can be found at http://www.watchmouse.com/en/feature/public-status-page.html or download the User Manual here: http://www.watchmouse.com/assets/docs/WatchMouse_PSP_Guide.pdf.

What might not be obvious is the logic behind the Public Status Page that indicates performance issues or a service disruption. In this post, I will reveal this little secret and show you how to tweak the algorithm.

Two parameters are predominantly taken into account when measuring the performance of a monitor: “first limit” and “second limit”. Both those parameters can be configured in the monitor setup pages under the “monitoring” dashboard, after switching to the “expert mode”.

If the total time of a public monitor stays below the first limit, the server is performing well. If it totals to a value between the first and second limit, the server is considered to perform poorly. Above the second limit, the performance is considered bad.

A WatchMouse Public Status Page uses both these parameters to identify performance issues and service disruptions.
For the history, it compares the average total time of each day with those parameters. The current performance measurement is based on exponential weighted average of most recent check results.

Setting up these parameters correctly is very important for your Public Status Page. Having them too low will result in a Public Status Page that continuously indicates performance issues whereas having them them too high will hide performance issues from your visitors which, they will eventually find out anyway.
If you haven’t already tuned these parameters, I’d strongly recommend that you do so after considering the following tips:

  • Get to know your monitors; check the performance charts under the “reports” dashboard.
  • Set the first limit slightly higher than the average total time of your monitors.
  • Set the second limit close to the total time it takes to load during a high traffic period.

For example: if you see that the average page load for a specific monitor is 4 seconds, set the first performance limit to 5000ms and the second limit to 8000ms. You can always check your Public Status Page to ensure the performance icons reflect what you had in mind. If not, you now know how to fix it!

For any questions or assistance just leave a comment or contact us through the help desk.

Post by Dimitris Balaouras. I’m the Lead Programmer at WatchMouse. I joined this great team of nerds back in 2006 and I have remained a true fan of WatchMouse ever since. Passionate about software engineering, I enjoy programming more than anything. I’m based in Greece and recently moved from the crowded Athens to Larisa, a small town in Northern Greece where I can code in peace :-)

WatchMouse Public Status Pages improved

Posted by mark on February 7th, 2011

Public Status Health Dashboard 4.0 released

Over the weekend we had a major release of our Public Status Pages. I’m very exited about the improvements both on the back-end and in functionality for our customers.

In this article I’d like to walk you through the improvements and invite you to share your suggestions for the next release.

Public Status Pages

The WatchMouse Public Status Pages

For those of you not familiar with our Public Status Pages yet, I included a short summary on the what, why, and who.

What is a Public Status Page?

A WatchMouse Public Status Page enables your organisation to display information about the availability and performance of your critical services. You can post announcements, annotate current issues, and optionally set up a special host name (CNAME) so people can access the status page on your domain, e.g. status.yourdomain.com. It is an easy control channel through which you can transparently inform visitors about the status of your sites and web services.

All WatchMouse Public Status Pages are hosted on Amazon’s Cloud infrastructure so they are available even if your site or service is not. Read more here.

Why Public Status Pages are important.

The single most important reason to have a Public Status Page or Health Dashboard is to have communication channels in place well before a ‘crisis’ strikes. Find more about why you need a status page in another article on this blog “Transparency is Critical When Sites #FAIL“.

Who is using Public Status Pages?

Here is a list of some of our more well known customers using the WatchMouse Public Status Pages:

More status dashboards (powered by WatchMouse and others too) can be found here.

Improvements in release 4.0

So what is improved in this new release?

  • New powerful architecture and storage engines, based on MongoDB
  • Highly available and even more scalable (still hosted in the AWS cloud)
  • Always up-to-date with latest check results, instead of updated on changes in monitor status
  • ‘Moving’ uptime figures over last 24h instead of today’s uptime
  • Better “per country” indication, now averages over the last N checks
  • Interactive charts, powered by the Google Visualization API
  • Zoom-able world map for more details in Europe
  • Clear daily uptime charts
  • Improved console
  • HTML support for public notes, including an HTML editor
  • SSL support

What’s next? Your opinion counts!

Some ideas we already have and working on for the next release of the WatchMouse Public Status Pages are:

  • Easier (self service) customization directly through the Public Status Pages console
  • Browsing back in time (for all charts and history section). The back-end system is ready, now working on the front end
  • Long term (monthly) charts
  • Private Status Pages (only accessible by authenticated users)
  • Real Time Status Pages (Comet/WebSockets support)
  • Public Status Widgets for easy integration into many popular blogging engines.

So what would you like to see in our next release? Please let us know in the comments, or contact us by creating a helpdesk ticket.

Mark Pors
CTO & co-founder