I created a simple open-source Git
server that’s live and is
implemented in a handful of Java class files.
Problem
We want an open source Git server that’s cross-platform, simple
(handful of source files) and fast. We want to offer some basic
functionality to share and work on one or more projects, but not have to
rely on some external service. We should put extra emphasis on
simplicity - you should be able to understand the whole project quite
quickly. It’s one thing to have an open source project, it’s quite
another to have an understandable open source project.
Existing Solutions
Here are a few existing solutions and why I didn’t consider them
useful:
GitHub & BitBucket - These are obviously
industry standard black boxes, not quite what we’re going for here.
GitLab - Open source, but an
absolute mammoth of a project at this point.
sourcehut - Now we’re cooking!
Except this project is also absolutely massive at this point. (You can
tell because Drew DeVault can change for it.) It’s certainly something
to be looked up to though.
Initial Implementation
Rapid prototype
This is the initial implementation coded over ~2 days, just as a
proof of concept. I really didn’t start this with any real kind of plan
in mind - and wasn’t entirely sure what issues may occur in the project.
The following are some implementation points.
Basic Java Server
The web server is very simple and is based on the Java ServerSocket
and Socket.
A connection comes in and is offloaded onto it’s own new Thread
as quickly as “possible”. It is possible to have a Thread
Pool, but at least this version aims to be as simple as possible. We
assume for now that we’re not going to be handling more than a 1000
requests a second (more likely one request per second).
On this new Thread the socket connection is read up to a reasonable
maximum number of bytes and the request URL is pulled out. For now, we
don’t care about any other part of the header and only support GET
requests (everything is treated as a GET request). We then send this
request URL off to the PageBuilder class.
Page Builder
The responsibility of PageBuilder is - well - to build
the page. This means that it tries to figure out what the user is
requesting (as quickly as it can) and return the required data. This
means it often sends the requests off to the Git class,
wrapping it in some HTML and then sending it back to the user.
This was quite slow, with some requests taking between ~50ms and
~250ms. Clearly this won’t be very scalable for a simple potato web
server.
Git CLI Wrapper
So now to discuss how it accessed Git. This was really quite simple,
as it executes a Process
and reads the standard output. It then breaks this String down and gives
it back to the PageBuilder to format and send back to the
user.
Whilst this method works, it’s incredibly slow and accounted for a
lot of time being wasted. To give an idea, approximately the following
would be done per request:
Create and start a Process handler in Java.
Find and load the Git binary into RAM.
Git binary reads the repository from disk, decompressing it where
needed and then executing the command on it.
Process ends and the standard output is read after execution
ends.
There should be a better way to do this…
Current Implementation
Alternative title: A better way to do this.
Web-ready version
This is the current implementation, and as you can see it’s a lot
more refined. Despite having more features, in some use-cases the pages
can be served more than 1000 times faster due to the various
implementations underneath. In the following I will go over some of the
larger changes made over the course of a few more days.
JSON Parser
First thing’s first, a JSON parser written from scratch. Why? Because
it was a great learning experience and it seemed crazy to use somebody’s
project just for a one-off parse of some JSON data. I’ve never written a
parser before, but the implementation was reasonably simple (minus a few
hiccups). It’s not fully tested as of yet, but it does allow for some
crazy things like this:
Everything is considered a String (which makes no real difference to
how JSON parses numbers anyway.)
It’s usable for this project, but is a long way off from being an
industry standard parser. One thing I hate about other Java JSON parsers
is that they like to die all the time during parsing if there’s even one
small thing out of place. Generally I like my parsers to log/flag a
warning and try to limp on. I’ll likely put this class in it’s own
repository at some point as it’s a very interesting implementation if
nothing else.
Maintenance Thread
The Maintenance class is a thread that simply checks for
changes in the remote repository and pulls them in if needed. It then
invalidates the repository update time, which in turn invalidates the
cache for any related items to that repository. This parts works quite
well.
Page Builder Improvements
As you can see, it now has a snazzy little logo (SVG) and some better
CSS. All of this is also loaded nicely from the configuration file. It
really makes the project look more modern (if I don’t blow my own horn
too much).
Pages are now built faster. There were some major changes to make
this possible:
The StringBuilder
is very cool and fast, much faster than +’ing Strings
together by quite some margin.
Escaping the Strings was a lot faster if we simply ran a single
parse of the String and inspected it character by character (as we have
to do it three times for &, < and
>).
Converting a String to bytes and writing to the OutputStream
turns out to be really expensive, so we try to minimize how much we do
it.
On top of this, we also now cache the content after it’s generated.
This reduces content generation time massively. If a page is being
shared and requested alot or somebody is flicking between a few pages,
this massively reduces the server load from ~50ms down to ~0ms. In
theory it should be possible to avoid the hug of death from popular link
sharing sites.
Git Repository Parser
I initially looked to use JavaGit, but the project is
abandoned, it’s unclear if it still works and it is effectively just a
wrapper around the command line anyway. We can and have done better.
There is JGit, but like the
eclipse IDE, it’s an absolutely massive project. The whole idea for us
is to have something small and light.
Now this was a headache. It turns out that there is a lot to a
repository. I won’t bore you with the details, but thank goodness we’re
not trying to add or remove objects. For example, there is not much
information out there about packing and unpacking objects - and even
less on references.
Future Work
There are still a few issues at the time of writing:
Diffs can take a long time to execute and generally don’t display so
well
Cache seems to have a bug for cache timeout - it should
refresh the content if it’s been around too long, but currently
doesn’t
Here is an overview of some of the basic features in the
pipeline:
Support branches/tags (references) in both the commit overview and
in the commits overview
File diff generated internally (instead of using the slow CLI
implementation)
Display file overviews for a specific commit
Coloured diffs
Download snapshots of the repository (by reference)
Proper logging to disk
Admin page with admin power and statistics about the server
In general, I want to push towards having the ability to clone, pull,
fetch, push, etc (i.e. basic Git server functionality). This may
required the need for accounts which is nice and scary! (Yaay cyber
security.)