Coffee Space


Listen:

Git Page

Preview Image

TL;DR

I created a simple open-source Git server that’s live and is implemented in a handful of Java class files.

Problem

We want an open source Git server that’s cross-platform, simple (handful of source files) and fast. We want to offer some basic functionality to share and work on one or more projects, but not have to rely on some external service. We should put extra emphasis on simplicity - you should be able to understand the whole project quite quickly. It’s one thing to have an open source project, it’s quite another to have an understandable open source project.

Existing Solutions

Here are a few existing solutions and why I didn’t consider them useful:

Initial Implementation

Rapid prototype

This is the initial implementation coded over ~2 days, just as a proof of concept. I really didn’t start this with any real kind of plan in mind - and wasn’t entirely sure what issues may occur in the project. The following are some implementation points.

Basic Java Server

The web server is very simple and is based on the Java ServerSocket and Socket. A connection comes in and is offloaded onto it’s own new Thread as quickly as “possible”. It is possible to have a Thread Pool, but at least this version aims to be as simple as possible. We assume for now that we’re not going to be handling more than a 1000 requests a second (more likely one request per second).

On this new Thread the socket connection is read up to a reasonable maximum number of bytes and the request URL is pulled out. For now, we don’t care about any other part of the header and only support GET requests (everything is treated as a GET request). We then send this request URL off to the PageBuilder class.

Page Builder

The responsibility of PageBuilder is - well - to build the page. This means that it tries to figure out what the user is requesting (as quickly as it can) and return the required data. This means it often sends the requests off to the Git class, wrapping it in some HTML and then sending it back to the user.

This was quite slow, with some requests taking between ~50ms and ~250ms. Clearly this won’t be very scalable for a simple potato web server.

Git CLI Wrapper

So now to discuss how it accessed Git. This was really quite simple, as it executes a Process and reads the standard output. It then breaks this String down and gives it back to the PageBuilder to format and send back to the user.

Whilst this method works, it’s incredibly slow and accounted for a lot of time being wasted. To give an idea, approximately the following would be done per request:

  • Create and start a Process handler in Java.
  • Find and load the Git binary into RAM.
  • Git binary reads the repository from disk, decompressing it where needed and then executing the command on it.
  • Process ends and the standard output is read after execution ends.

There should be a better way to do this…

Current Implementation

Alternative title: A better way to do this.

Web-ready version

This is the current implementation, and as you can see it’s a lot more refined. Despite having more features, in some use-cases the pages can be served more than 1000 times faster due to the various implementations underneath. In the following I will go over some of the larger changes made over the course of a few more days.

JSON Parser

First thing’s first, a JSON parser written from scratch. Why? Because it was a great learning experience and it seemed crazy to use somebody’s project just for a one-off parse of some JSON data. I’ve never written a parser before, but the implementation was reasonably simple (minus a few hiccups). It’s not fully tested as of yet, but it does allow for some crazy things like this:

0001 {
0002   /* Comment here ignored */
0003   "test": "123"
0004 }

Everything is considered a String (which makes no real difference to how JSON parses numbers anyway.)

It’s usable for this project, but is a long way off from being an industry standard parser. One thing I hate about other Java JSON parsers is that they like to die all the time during parsing if there’s even one small thing out of place. Generally I like my parsers to log/flag a warning and try to limp on. I’ll likely put this class in it’s own repository at some point as it’s a very interesting implementation if nothing else.

Maintenance Thread

The Maintenance class is a thread that simply checks for changes in the remote repository and pulls them in if needed. It then invalidates the repository update time, which in turn invalidates the cache for any related items to that repository. This parts works quite well.

Page Builder Improvements

As you can see, it now has a snazzy little logo (SVG) and some better CSS. All of this is also loaded nicely from the configuration file. It really makes the project look more modern (if I don’t blow my own horn too much).

Pages are now built faster. There were some major changes to make this possible:

  • The StringBuilder is very cool and fast, much faster than +’ing Strings together by quite some margin.
  • Escaping the Strings was a lot faster if we simply ran a single parse of the String and inspected it character by character (as we have to do it three times for &, < and >).
  • Converting a String to bytes and writing to the OutputStream turns out to be really expensive, so we try to minimize how much we do it.

On top of this, we also now cache the content after it’s generated. This reduces content generation time massively. If a page is being shared and requested alot or somebody is flicking between a few pages, this massively reduces the server load from ~50ms down to ~0ms. In theory it should be possible to avoid the hug of death from popular link sharing sites.

Git Repository Parser

I initially looked to use JavaGit, but the project is abandoned, it’s unclear if it still works and it is effectively just a wrapper around the command line anyway. We can and have done better. There is JGit, but like the eclipse IDE, it’s an absolutely massive project. The whole idea for us is to have something small and light.

Now this was a headache. It turns out that there is a lot to a repository. I won’t bore you with the details, but thank goodness we’re not trying to add or remove objects. For example, there is not much information out there about packing and unpacking objects - and even less on references.

Future Work

There are still a few issues at the time of writing:

Here is an overview of some of the basic features in the pipeline:

In general, I want to push towards having the ability to clone, pull, fetch, push, etc (i.e. basic Git server functionality). This may required the need for accounts which is nice and scary! (Yaay cyber security.)