A history of DDoS attack – How my server died

This blog post is postmortem of my infrastructure that was attacked on Sunday by Argentinian attacker and died because of DDoS. I will share with you all actions that I took in order to bring back stability of services.

Summary

Attack has started : 19 June 2016 at 3:20PM UTC

Attack has ended : 19 June 2016 at 4:10PM UTC

Users affected : 30-40 users

Extra cost due to attack : less than 2$

Existing Infrastructure

Let me give you a brief overview of existing infrastructure of Helbreath Poland.

cloudcraft - Helbreath Poland

On the image above you can see two components that create my infrastructure, it is Route53 and one small t2 EC2 instance.

Let’s agree on something, it is not difficult and “big” infrastructure, but for this purpose it works perfectly, right now the server has 30-40 people playing every second, possibly with this VM we can go up with 100 ? 150 more people ? But for a now it is fine!

DDoS what is that !?

In the simplest terms, DDoS is a type of attack that sends a lot of data from lots of places (computers), often you can say that it is distributed attack because you can use computers from a different part of the world to attack somebody infrastructure.

Aggressors send a lot of data so that your infrastructure can’t handle so many incoming packages, and eventually will stop working or access will be very limited. This type of attack doesn’t require really big knowledge, everyone who has access to the internet can prepare that kind of attack. Deeper explanation you can find at Wikipedia.

In our case, they were attacking two ports: 321, 1 and 3007 over TCP.

Full story

The calm before the storm

I knew that something will happen because there was a player that log in to our server and threat to us that he is going to destroy the server. Well in after 5 minutes, people start getting lag, and more lags, then a lot of them got disconnected from the server.

So it begins, my actions

The server was basically killed, VM don’t respond.

As a said earlier, people got lags, disconnected. I started doing some investigation and I tried to log into my VM but…yeah, I couldn’t even do that. RDP wasn’t responding.

Decided to switch off all incoming traffic, and allow only to my IP.

I decided to switch off all incoming traffic, which means that VM is taken out from public availability. That way I cut off all incoming good and “dirty” connections.

I have done that by changing a rule in a security group, as on the image below. First rule, All TCP, Anywhere has been removed.

security-group-changed

While all traffic is disallowed except my computer. I can log in and maintain a VM. Which means do backups check logs what happened, look for attacker IP address.

Gradually open traffic but it seems that there is still an issue.

Next decision that I have taken was to start gradually allowing incoming external traffic to my infrastructure.

But as you can see below there was second hit even greater. Between 1530 and 1600 was quite calm, but then when I allowed around 16:00 was a big bang.

network-in-attack

 

Again, repeat first step switched off everything and let’s wait…

Check IPAddress of attackers

In a meantime, I was looking for IP address/es of attackers, and I found that attacker was from Argentina.

Add entry to ACL with IPAddress, decided to block attacker their entire subnet

To prevent and block any dirty connection, I have updated ACL that manage and filter out any incoming and outcoming traffic. I decided that the safest option is to block their whole subnet.

acl-blocked-entire-subnet

Again start letting incoming traffic to infrastructure.

At the end around 16:05 I started once again to letting people into the server.

You can see on the image below incoming network. From about 16:05-16:10 send data is on a fairly OK level if you compare with what was 20 minutes before.

People can log in and they don’t have any problems with the game.

healthly-situation

Problem solved, what next ?

That was my quick story what happened to me on Sunday afternoon. Problem solved but what about further actions to prevent, or maybe create a failover plan, that can at least allow people to play ?

Introduce Load Balancer

First of all, what I have to do is to introduce a load balancer (ELB), even for this one VM. In the future if I will notice that attack is incoming I can immediately spin up fresh VM and redirect every player to this box. In a meantime, I have some extra time to deal with attackers.

cloudcraft - Helbreath Poland V2

 

Let’s imagine that attack is incoming, and a middle box is affected, so I immediately spin up two VM and fire up services on this boxes.

Because players connect by DNS, and it will be a stream for ELB, they will be automatically redirected to healthy instances. Of course, this way won’t help if an attack will be really, really serious and they will attack directly DNS

Monitor incoming connection to get a better overview.

This is very important! My infrastructure didn’t have this at the point of attack. If this attacker didn’t log into the game and threat to us. I wouldn’t know explicit his IP Address. Which could complicate things and probably it would take me much more time to solve this issue.

With help comes Flow Logs in AWS for your VPC. This monitor and log all IP addresses that are connected to your infrastructure. That way if they will attack again, from different subnet I can get their IP addresses from logs, then block traffic of this subnet to my infrastructure.

flow-logs-vpc

Set alarms on the usage of VM resources.

This part is also very important and it is going to play nicely with previous steps, so on AWS you can set up alarms if a specific resource is going to beyond of a certain threshold.

ddos-detector-alarm

In my case, I have created an alarm for data send to my VM by the external world. The alarm will go off when there will be a spike to of incoming data greater than 1GB a minute then it will send me a notification to an email. That way I can be aware of the possible attack or big popularity of my server 😀 and jump into action

Closing words

To sum up, that was a really amazing experience even if some players were affected and I was really pissed off, but I treat this as a lesson because I have learnt a lot of additional functionalities on AWS, and general ops approach this problem.

Refactoring legacy GUI application to CLI

If you ever wonder how it is to work with 15-year-old legacy C++ code, and how to make refactoring, this blog is perfect for you 🙂 

As ou may remember, I promised you to show you work that I do for Helbreath. When I decided to work on that, the first decision I made was to try to get rid of this horrible GUI, that was aperitif before I do more serious work.

Before we dive into C++ code, let have a discussion why GUI is evil in your backend server applications, shall we ?

If backend service, CLI only!

Be clean

The first argument is that your code is much cleaner because a program doesn’t have an unnecessary code, which is responsible for drawing and behaviour of your GUI, additionally you don’t mix context of GUI and context of your service. Which means that you don’t have noise in your code.

Fewer resources and dependencies

Another very important argument, your server will need fewer resources to run your application. Even you can run your OS without GUI in headless mode.

But, wait what with fewer dependencies ? If you don’t have GUI your code immediately has fewer dependencies to external libraries, pure profit! That way you don’t have to manage additional packages and worry that something won’t work on “very special” environment or OS settings. Moreover, developers who want to work on that project are less likely to get problems with the project set up.

Automatisation and ops work

The ultimate argument that you have to read and it applies to any software in production for more than 1 people.

Having you service as a CLI will help a lot with ops work, with CLI service you can automate everything from deployment, templating, to a startup of your application. Whereas with GUI application you can’t do that very easily, due to involved manual steps.

Next important argument – remote access.

Ideally, you don’t have to have access to the whole server/VM/machine to maintain your application, instead, you can easily connect to this application remotely and manage from your computer. This approach is more secure, we are avoiding direct access to a server and we also can whitelist IP address with specific port.

Refactoring time

Old Way

Let’s move on to our services that make Helbreath server running.

old_way_hb_services

Above you can see how two services looked like before refactorization, it was horrible GUI, lot of manual steps, such as providing username/password to the database. Each time you restart application you have to manually put credentials to the database. Imagine now that I do 20-30+ releases a day, it means that I would need to waste my keystrokes each time.

New Way

new_way_hb_services

In the other hand, this image above shows the current state of both services. It is much more beautiful, isn’t it ?!

Pure console, with some output information and nothing else!

No dialogue boxes, no fucking buttons, no weird messages. Just pure console.
But HOLA! wait! How can you see what is going on with your services ?

Easy answer!

Logs, Luke log everything!

In this case, I log everything to file and then use nxlog to send to papertrail.

papertrail_log

Now let’s check some code!

At the beginning I created a story on Github, just to have some place where I can track my work, and then pull request.

This is very important, every refactorization in legacy code (this is almost 15 yo) is big and difficult. At this example, I will try to share some of the mine strategies.

Make a research!

Spend a good time to analyse code and dependencies. For this specific problem, it took me like 1-2 days to understand a problem and come up with a solution. This is very important for young junior developers! You are a problem solver, not a code monkey, research is part of your job, don’t worry if you send one or two days on researching something.

Since I wanted to get rid of GUI, I had to check which part of code has the dependency to GUI the code or libs. So as you can see here and here I listed out all main places where GUI sits.

Cheat, Wrap, Hide!

My next advice is, cheat if you can, don’t refactor everything at once!

Do small bits until your old code will be so granular that you can understand the domain, and rewrite it. In this commit, I wrapped all the things into a new class.

I cheated because it still has GUI dependency (to the HWND class) but it’s hidden. But at least it doesn’t have code for dialog boxes, buttons etc.

Remove, Remove, Remove!

Most enjoyable part is, removing unused code and here, as you can see I removed a lot of graphics drawing specific stuff, which is not in use anymore.

At the end of this refactoring, I still have some dependencies to GUI, mostly to HWND class, but it is necessary to run a service because old windows messaging use this library to create async calls via TCP/UDP. Yeah, you read that correctly messaging require GUI dependency, total madness. It ended up with fairly ok refactored code, I don’t need any manual step apriori to run a server, everything is automated. I am ok with that for a now.

Helbreath Poland project

hb_blogpost

What is Helbreath ?

Helbreath is MMO games created back in 1999 by a korean studio – Siementech, seems that they are dead 🙁

At the same time there was a group of people who created open source code for this game, both server and client side.

This is very important, to understand that those sources were developed in 1999 / 2000 so some approaches were really good at that time. Now some of these approaches can be obsoleted.

What am I doing ?

I took sources developed ages ago by the community, and I put them on Github, and start fixing issues, refactoring some code and adjusting to standards. You can expect series of blog post on this topic.

Why am I doing this ?

Well, it is quite personal, because I have started my programming journey from this game, back when I was 13 or 14, that was my first MMO game and then I decided that I want to make my own server (this is an archived website of the server back 2006).

Well, at that time I didn’t know that I have to learn programming to even start my own server. I downloaded sources and yeah…I had to learn C++.

I read few C++ tutorials and it was a painful journey, like really painful as far I can remember most painful part was to understand classes, objects, reference, pointers.

It took me like two weeks to setup Visual C++ 6.0 (yeah something like that exists), and then I immerse into C++, even to that point that I haven’t been learning at my school (almost didn’t pass to next class) because every second I was thinking about programming and my “server”.

What is my goal ?

First of all, I want to clean up and refactor current sources, fix all critical and major issues/bugs.

This series of blog posts can be somehow a guide for juniors developers because I will show you few things that you shouldn’t do when you are writing your applications.
Then, I want to run a server for people to play, and check performance, and give back something to community of HB, and of course I am big fan of this game so I will play 🙂

My very, very end goal it to have at least one component of the server, rewritten in any language so that it can be run on Linux. Moreover, the domain of whole game is not clear and I want to write down documentation and get more readable code.

Conclusion

Stay tuned because a lot of content is coming! I spent last month to make it happen, I did few snap-storms on my snapchat about that. It’s ging to be amazing to see this transition.