Foreword

So this post is inspired by my last two years of work, first time both in a Agilish and Devops Environment. I’m not a WebOps guy, but was involved in some minor jobs related too.

What is Devops?

Devops is a culture, not a person, position, framework or application. It is a way of working together of Developers and Operations. Developers don’t throw their executables over the fence and yell „it’s in your yard now!”. In Devops both teams work as one, obviously since operations and operation automation is a different nature of work from classic software, its rare to have a fully cross functional team as Scrum sees it, but still both should have non latency communication with each other – best if they share the workspace.

Tech involved

So the project I worked on is for a major British agency, we wanted to incorporate the philosophy of micro-services which is current flavor of the month.

Since its a hybrid Java/Scala stack we went for Dropwizard as even thought it doesn’t have 1.0 on its banner, its better battle-tested than most such frameworks out there. It has sound ideas behind it, has a really nice config library which understand both YAML and JSON files. With addition of health-checks and metrics it helps you to monitor production applications.

For customer facing parts we are using Play! Framework 2 in the Scala variant.

So for the rest of this post will be revolving around these specific API’s.

Micro-services

Micro-services are about making little, single responsibility, horizontally scalable and stateless applications. They come mostly in HTTP API flavor through which they communicate between each other. They are easily deployable, and scaling is just about spawning a new VM and putting it into the load balancer. Since they are small, new developers should easily pick up on what are the services responsibilities. More to read on micro-services you can find here.

Distributed failure

Some of the flows we had took 10 calls between services in various ways. This takes a lot of effort to debug, maintain and deploy to production.

Problems starts when we have a flow of services A -> B -> C -> D -> E -> F (where ->is a HTTP request) and suddenly A gets a 500 from B, we jump on the Bbox to check what is happening, we see logs filled with stack traces – hey I know the code, SearchController line 40 is throwing a ConnectionException. Probably because B doesn’t have connection from C, or maybeC doesn’t have connection to D and throws 500 because of that? You get the drift.

As the developer who worked with these services for 2 years, of course you do know what is happening. But people who are deploying your services to a chosen environment might not know that. So the least you can do is to catch these exceptions, give them meaningful descriptions of what might have happened and log what the stack trace.

You probably think thats old news, common knowledge, the basics. Well I still see code like this in the wild

Try(service.call(request)) match {
  case Success(result) => */ do your thing */
  case Failure(error) => throw new InternalApplicationException
}

Even though the error is supplied by the Failure case, its just ignored and we throw a totally new, uninformative InternalApplicationException, without explanation or anything. Its perfectly fine to fail, but at least give us the reason why, some feedback.

Try(service.call(request)) match {
  case Success(result) => */ do your thing */
  case Failure(connectionError: ConnectionException) =>
    logger.error(connectionError)
    throw new InternalApplicationException(
      s"Could not connect to the service under this url: ${service.url}",
      connectionError)

  case Failure(applicationError: InternalApplicationException) =>
    logger.error(applicationError)
    throw new InternalApplicationException(
      "The service experienced an failure",
      applicationError)
}

Give some context why the application just failed. It will be a favor for the Ops people, support and future you.

This is obviously a solution on a micro-scale, in the grand scheme of things it would be good to have a way to immediately gather failures. You should aim for a centralized log store, its not ideal for services which were just deployed, but will help identifying issues in the scope of the whole solution. It’s an already solved problem and there available tools for solving them.

Foolproof

Our services are Dropwizard based, it does a great job of informing if you did mis-configure the application

Whenever you will miss or misspell some configuration key, it will inform you during the start which key is missing or if you misspelled it, will show you most likely key names for configuration.

Since its uses javax.validation API for mapping and validation of the configuration to the config object, you can create complex logic via code and annotations.

As for the Play application, its a different kettle of fish.

Play uses the Typesafe Config library. It’s really nice and I use it for my side projects since it usesHOCON format which I find readable and easy to understand(arrays in YAML confuse me).

Accessing the configuration from the application code level is also easy.

play.Play.application.configuration.getString("foo.bar.baz")

Have to be honest, its cumbersome. This returns an Option[String] (FP heads will know). But many people from outside of FP world will just call

val fobarbaz: String = play.Play.application.configuration.getString("foo.bar.baz").get

Which will yield a NoSuchMethodException at line XX if value for this key will not be found. Of course as the developer you know that somebody forgot to set the address of the RabbitMQ cluster or whatever auxiliary dependency which the app requires.

But the guys priming your app in the production environment will not, and you will have to turn into their personal Stack trace analyzer.

As for example of this app, best if you wrap it and throw a meaningful error

object MyConfigurationWrapper{
  case class ConfigurationKeyMissingException(message: String) extends Exception(message)

  private lazy val config = play.Play.application.configuration

  def getString(key: String) =
    config.getString(key)
    .orElse(throw ConfigurationKeyMissingException(key))
    .get
}

This way whenever other people will be deploying your application and see that something went wrong, they can actually see that it might be a config issue and see that something is missing – or even you can add some basic config validation.

Modularized configuration

Other thing about configuration is to have it easily interchangeable parts of it. If you are working on a distributed solution and know that centralized configuration store such as ZooKeeper will not be part of MVP, better prepare your application for having to support multiple config files.

Here on bright side is Typesafe’s config library, it lets you do substitutions, include other files within each other, which we abused to separate application logic configuration and environment configuration.

As for Dropwizard… well the guys behind Dropwizard are pretty adamant on having a single configuration file. They even won’t let you use a logback.xml file. So the Ops team need to generate a whole config file alongside with bits of application logic in it, which isn’t ideal.

Location, location, location

We had an issue, once one of our applications didn’t work in CI environments, just stopped and did nothing.

Thats because we are developing on powerful I7 Macs. And all VMs in the target environments are single core. The mentioned app was based on Akka and because of misconfiguration(hard coded), it was led to resource(thread) starvation and Actor A didn’t want give access to the thread to other actors.

That’s why as soon as possible you need to setup your personal VMs with the environments setup as close as possible to those which are on productions. You don’t need to create 1600 machines to emulate 1:1 production.

One machine should suffice to see how the service likes it there, a second, third and a forth machine for load balancing would be handy.

This will help you see how services living on different VMs react when they lose connection, or if the services indeed act the same behind the load balancer – hard coded state and such can make your service act differently on 1 per 3 requests – not fun to troubleshoot when you don’t have security clearance to access said environment.

Don’t try to play with VirtualBox and manually installing OSes on VMs. Use Vagrant.

In the perfect world, each developer should be able to download the whole stack and run it on their machines in a virtualised environment.

Package delivered

Since the target systems are using YUM for package we incorporated build of RPM packages in our build process. We used for this Maven RPM Plugin. With the advice from section before, you can test the end-artifact(.rpm) in an environment you just setup on your machine.

This way you can rapidly receive feedback of how your package is acting in a target environment. Now you have the full control of how your application gets and acts in production.

Conclusion

You should think of it as the Operations are your first customers. Being deployable is the key feature of every piece of software. The way you run the application, what logs there are logged, configurability of the log format and possible appenders available in the application are features which Operations will use to deploy and maintain the application.