Coffee Space


Listen:

Code Highlighting

Preview Image

Back in a really early versions of this website, there used to be a JavaScript-based code highlighting system. One of the major benefits of this syntax highlighter is that it was language agnostic, meaning that it had zero idea of the language it was parsing. This meant that it was fast and generic.

This generally worked well, but I was not happy that it relied on the client’s browser to process the web page. Back then, the entire web page was rendered using JavaScript - meaning that on some browsers, it simply didn’t load at all.

Pandoc Highlighting

I had some success in the past using the Pandoc syntax highlighter, where you can run something like:

0001 pandoc input.md --highlight-style style.theme -o output.html

The problem with this is:

  1. Styles not available in all versions - The server is a potato and is running an old version of Debian (but it’s cheap). This means that I have to generate a theme file.
  2. Themes not handled reliably - Theme files are not passed reliably in older versions. You have to do some work to get something working in older and newer versions of Pandoc.
  3. Difference between local and server - When you do get a theme actually working, there is a massive visual difference between the server version and the locally rendered version.

Whilst the built-in syntax highlighting is great, it’s just not reliably enough.

Pandoc Filter

Out of all of the options for writing a Pandoc filter, Python seemed like the best option out of all the options I didn’t like. I did initially try parsing the JSON manually, but there appears to be something special going on with the JSON itself and I cannot be bothered to figure it out.

0002 from pandocfilters import toJSONFilter, RawBlock
0003 import re

Here we use the pandocfilters module, and specifically use the toJSONFilter (parse each part of the JSON markup) and RawBlock (used to create the end element).

Foreshadowing: We also see the regex engine imported there, this was a later speed-up.

0004 symbolother_col  = "#FF920D";
0005 symbolmath_col   = "#197BCE";
0006 symbolnumber_col = "#E32929";
0007 symbolpairs_col  = "#AF18DB";
0008 keyword_col      = "#48BA1C";

These are the colours I chose after searching “vibrant HTML colours” into a search engine. It’s not yet the best colour scheme to match the light and dark modes of this website.

0009 fontstart = '<b><font color="';
0010 fontend = '</font></b>';
0011 
0012 symbols = {
0013   # symbolother
0014   "&": fontstart + symbolother_col + '">&amp;' + fontend,
0015   ".": fontstart + symbolother_col + '">.' + fontend,
0016   ",": fontstart + symbolother_col + '">,' + fontend,
0017   "?": fontstart + symbolother_col + '">?' + fontend,
0018   "!": fontstart + symbolother_col + '">!' + fontend,
0019   "£": fontstart + symbolother_col + '">&#163;' + fontend,
0020   "$": fontstart + symbolother_col + '">$' + fontend,
0021   "%": fontstart + symbolother_col + '">%' + fontend,
0022   "@": fontstart + symbolother_col + '">@' + fontend,
0023   "~": fontstart + symbolother_col + '">~' + fontend,
0024   "|": fontstart + symbolother_col + '">|' + fontend,
0025   "#": fontstart + symbolother_col + '">#' + fontend,
0026   ":": fontstart + symbolother_col + '">:' + fontend,
0027   ";": fontstart + symbolother_col + '">;' + fontend,
0028   "=": fontstart + symbolother_col + '">=' + fontend,
0029   "_": fontstart + symbolother_col + '">_' + fontend,
0030   "\"": fontstart + symbolother_col + '">&#34;' + fontend,
0031   "\'": fontstart + symbolother_col + '">\'' + fontend,
0032   # symbolpairs
0033   "[": fontstart + symbolpairs_col + '">[' + fontend,
0034   "]": fontstart + symbolpairs_col + '">]' + fontend,
0035   "{": fontstart + symbolpairs_col + '">{' + fontend,
0036   "}": fontstart + symbolpairs_col + '">}' + fontend,
0037   "(": fontstart + symbolpairs_col + '">(' + fontend,
0038   ")": fontstart + symbolpairs_col + '">)' + fontend,
0039   "<": fontstart + symbolpairs_col + '">&lt;' + fontend,
0040   ">": fontstart + symbolpairs_col + '">&gt;' + fontend,
0041   # symbolmath
0042   "+": fontstart + symbolmath_col + '">+' + fontend,
0043   "-": fontstart + symbolmath_col + '">-' + fontend,
0044   "*": fontstart + symbolmath_col + '">*' + fontend,
0045   "^": fontstart + symbolmath_col + '">^' + fontend,
0046   "/": fontstart + symbolmath_col + '">/' + fontend,
0047   "\\": fontstart + symbolmath_col + '">\\' + fontend,
0048   # symbolnumber
0049   "0": fontstart + symbolnumber_col + '">0' + fontend,
0050   "1": fontstart + symbolnumber_col + '">1' + fontend,
0051   "2": fontstart + symbolnumber_col + '">2' + fontend,
0052   "3": fontstart + symbolnumber_col + '">3' + fontend,
0053   "4": fontstart + symbolnumber_col + '">4' + fontend,
0054   "5": fontstart + symbolnumber_col + '">5' + fontend,
0055   "6": fontstart + symbolnumber_col + '">6' + fontend,
0056   "7": fontstart + symbolnumber_col + '">7' + fontend,
0057   "8": fontstart + symbolnumber_col + '">8' + fontend,
0058   "9": fontstart + symbolnumber_col + '">9' + fontend
0059 }

I pre-build a basic set of symbols in a dictionary. Later on we will pass this on to the regex engine.

0060 c_symbols = {
0061   # Keywords: https://en.cppreference.com/w/cpp/keyword
0062   "alignas":                  fontstart + keyword_col + '">' + "alignas"                  +'' + fontend,
0063   "alignas":                  fontstart + keyword_col + '">' + "alignas"                  +'' + fontend,
0064   "alignof":                  fontstart + keyword_col + '">' + "alignof"                  +'' + fontend,
0065   "and":                      fontstart + keyword_col + '">' + "and"                      +'' + fontend,
0066   "and_eq":                   fontstart + keyword_col + '">' + "and_eq"                   +'' + fontend,
0067   "asm":                      fontstart + keyword_col + '">' + "asm"                      +'' + fontend,
0068   "atomic_cancel":            fontstart + keyword_col + '">' + "atomic_cancel"            +'' + fontend,
0069   "atomic_commit":            fontstart + keyword_col + '">' + "atomic_commit"            +'' + fontend,
0070   "atomic_noexcept":          fontstart + keyword_col + '">' + "atomic_noexcept"          +'' + fontend,
0071   "auto":                     fontstart + keyword_col + '">' + "auto"                     +'' + fontend,
0072   "bitand":                   fontstart + keyword_col + '">' + "bitand"                   +'' + fontend,
0073   "bitor":                    fontstart + keyword_col + '">' + "bitor"                    +'' + fontend,
0074   "bool":                     fontstart + keyword_col + '">' + "bool"                     +'' + fontend,
0075   "break":                    fontstart + keyword_col + '">' + "break"                    +'' + fontend,
0076   "case":                     fontstart + keyword_col + '">' + "case"                     +'' + fontend,
0077   "catch":                    fontstart + keyword_col + '">' + "catch"                    +'' + fontend,
0078   "char":                     fontstart + keyword_col + '">' + "char"                     +'' + fontend,
0079   "char8_t":                  fontstart + keyword_col + '">' + "char8_t"                  +'' + fontend,
0080   "char16_t":                 fontstart + keyword_col + '">' + "char16_t"                 +'' + fontend,
0081   "char32_t":                 fontstart + keyword_col + '">' + "char32_t"                 +'' + fontend,
0082   "class":                    fontstart + keyword_col + '">' + "class"                    +'' + fontend,
0083   "compl":                    fontstart + keyword_col + '">' + "compl"                    +'' + fontend,
0084   "concept":                  fontstart + keyword_col + '">' + "concept"                  +'' + fontend,
0085   "const":                    fontstart + keyword_col + '">' + "const"                    +'' + fontend,
0086   "consteval":                fontstart + keyword_col + '">' + "consteval"                +'' + fontend,
0087   "constexpr":                fontstart + keyword_col + '">' + "constexpr"                +'' + fontend,
0088   "constinit":                fontstart + keyword_col + '">' + "constinit"                +'' + fontend,
0089   "const_cast":               fontstart + keyword_col + '">' + "const_cast"               +'' + fontend,
0090   "continue":                 fontstart + keyword_col + '">' + "continue"                 +'' + fontend,
0091   "co_await":                 fontstart + keyword_col + '">' + "co_await"                 +'' + fontend,
0092   "co_return":                fontstart + keyword_col + '">' + "co_return"                +'' + fontend,
0093   "co_yield":                 fontstart + keyword_col + '">' + "co_yield"                 +'' + fontend,
0094   "decltype":                 fontstart + keyword_col + '">' + "decltype"                 +'' + fontend,
0095   "default":                  fontstart + keyword_col + '">' + "default"                  +'' + fontend,
0096   "delete":                   fontstart + keyword_col + '">' + "delete"                   +'' + fontend,
0097   "do":                       fontstart + keyword_col + '">' + "do"                       +'' + fontend,
0098   "double":                   fontstart + keyword_col + '">' + "double"                   +'' + fontend,
0099   "dynamic_cast":             fontstart + keyword_col + '">' + "dynamic_cast"             +'' + fontend,
0100   "else":                     fontstart + keyword_col + '">' + "else"                     +'' + fontend,
0101   "enum":                     fontstart + keyword_col + '">' + "enum"                     +'' + fontend,
0102   "explicit":                 fontstart + keyword_col + '">' + "explicit"                 +'' + fontend,
0103   "export":                   fontstart + keyword_col + '">' + "export"                   +'' + fontend,
0104   "extern":                   fontstart + keyword_col + '">' + "extern"                   +'' + fontend,
0105   "false":                    fontstart + keyword_col + '">' + "false"                    +'' + fontend,
0106   "float":                    fontstart + keyword_col + '">' + "float"                    +'' + fontend,
0107   "for":                      fontstart + keyword_col + '">' + "for"                      +'' + fontend,
0108   "friend":                   fontstart + keyword_col + '">' + "friend"                   +'' + fontend,
0109   "goto":                     fontstart + keyword_col + '">' + "goto"                     +'' + fontend,
0110   "if":                       fontstart + keyword_col + '">' + "if"                       +'' + fontend,
0111   "inline":                   fontstart + keyword_col + '">' + "inline"                   +'' + fontend,
0112   "int":                      fontstart + keyword_col + '">' + "int"                      +'' + fontend,
0113   "long":                     fontstart + keyword_col + '">' + "long"                     +'' + fontend,
0114   "mutable":                  fontstart + keyword_col + '">' + "mutable"                  +'' + fontend,
0115   "namespace":                fontstart + keyword_col + '">' + "namespace"                +'' + fontend,
0116   "new":                      fontstart + keyword_col + '">' + "new"                      +'' + fontend,
0117   "noexcept":                 fontstart + keyword_col + '">' + "noexcept"                 +'' + fontend,
0118   "not":                      fontstart + keyword_col + '">' + "not"                      +'' + fontend,
0119   "not_eq":                   fontstart + keyword_col + '">' + "not_eq"                   +'' + fontend,
0120   "nullptr":                  fontstart + keyword_col + '">' + "nullptr"                  +'' + fontend,
0121   "operator":                 fontstart + keyword_col + '">' + "operator"                 +'' + fontend,
0122   "or":                       fontstart + keyword_col + '">' + "or"                       +'' + fontend,
0123   "or_eq":                    fontstart + keyword_col + '">' + "or_eq"                    +'' + fontend,
0124   "private":                  fontstart + keyword_col + '">' + "private"                  +'' + fontend,
0125   "protected":                fontstart + keyword_col + '">' + "protected"                +'' + fontend,
0126   "public":                   fontstart + keyword_col + '">' + "public"                   +'' + fontend,
0127   "reflexpr":                 fontstart + keyword_col + '">' + "reflexpr"                 +'' + fontend,
0128   "register":                 fontstart + keyword_col + '">' + "register"                 +'' + fontend,
0129   "reinterpret_cast":         fontstart + keyword_col + '">' + "reinterpret_cast"         +'' + fontend,
0130   "requires":                 fontstart + keyword_col + '">' + "requires"                 +'' + fontend,
0131   "return":                   fontstart + keyword_col + '">' + "return"                   +'' + fontend,
0132   "short":                    fontstart + keyword_col + '">' + "short"                    +'' + fontend,
0133   "signed":                   fontstart + keyword_col + '">' + "signed"                   +'' + fontend,
0134   "sizeof":                   fontstart + keyword_col + '">' + "sizeof"                   +'' + fontend,
0135   "static":                   fontstart + keyword_col + '">' + "static"                   +'' + fontend,
0136   "static_assert":            fontstart + keyword_col + '">' + "static_assert"            +'' + fontend,
0137   "static_cast":              fontstart + keyword_col + '">' + "static_cast"              +'' + fontend,
0138   "struct":                   fontstart + keyword_col + '">' + "struct"                   +'' + fontend,
0139   "switch":                   fontstart + keyword_col + '">' + "switch"                   +'' + fontend,
0140   "synchronized":             fontstart + keyword_col + '">' + "synchronized"             +'' + fontend,
0141   "template":                 fontstart + keyword_col + '">' + "template"                 +'' + fontend,
0142   "this":                     fontstart + keyword_col + '">' + "this"                     +'' + fontend,
0143   "thread_local":             fontstart + keyword_col + '">' + "thread_local"             +'' + fontend,
0144   "throw":                    fontstart + keyword_col + '">' + "throw"                    +'' + fontend,
0145   "true":                     fontstart + keyword_col + '">' + "true"                     +'' + fontend,
0146   "try":                      fontstart + keyword_col + '">' + "try"                      +'' + fontend,
0147   "typedef":                  fontstart + keyword_col + '">' + "typedef"                  +'' + fontend,
0148   "typeid":                   fontstart + keyword_col + '">' + "typeid"                   +'' + fontend,
0149   "typename":                 fontstart + keyword_col + '">' + "typename"                 +'' + fontend,
0150   "union":                    fontstart + keyword_col + '">' + "union"                    +'' + fontend,
0151   "unsigned":                 fontstart + keyword_col + '">' + "unsigned"                 +'' + fontend,
0152   "using":                    fontstart + keyword_col + '">' + "using"                    +'' + fontend,
0153   "virtual":                  fontstart + keyword_col + '">' + "virtual"                  +'' + fontend,
0154   "void":                     fontstart + keyword_col + '">' + "void"                     +'' + fontend,
0155   "volatile":                 fontstart + keyword_col + '">' + "volatile"                 +'' + fontend,
0156   "wchar_t":                  fontstart + keyword_col + '">' + "wchar_t"                  +'' + fontend,
0157   "while":                    fontstart + keyword_col + '">' + "while"                    +'' + fontend,
0158   "xor":                      fontstart + keyword_col + '">' + "xor"                      +'' + fontend,
0159   "xor_eq":                   fontstart + keyword_col + '">' + "xor_eq"                   +'' + fontend,
0160   "final":                    fontstart + keyword_col + '">' + "final"                    +'' + fontend,
0161   "override":                 fontstart + keyword_col + '">' + "override"                 +'' + fontend,
0162   "transaction_safe":         fontstart + keyword_col + '">' + "transaction_safe"         +'' + fontend,
0163   "transaction_safe_dynamic": fontstart + keyword_col + '">' + "transaction_safe_dynamic" +'' + fontend,
0164   "import":                   fontstart + keyword_col + '">' + "import"                   +'' + fontend,
0165   "module":                   fontstart + keyword_col + '">' + "module"                   +'' + fontend,
0166   "#if":                      fontstart + keyword_col + '">' + "#if"                      +'' + fontend,
0167   "#elif":                    fontstart + keyword_col + '">' + "#elif"                    +'' + fontend,
0168   "#else":                    fontstart + keyword_col + '">' + "#else"                    +'' + fontend,
0169   "#endif":                   fontstart + keyword_col + '">' + "#endif"                   +'' + fontend,
0170   "#ifdef":                   fontstart + keyword_col + '">' + "#ifdef"                   +'' + fontend,
0171   "#ifndef":                  fontstart + keyword_col + '">' + "#ifndef"                  +'' + fontend,
0172   "#define":                  fontstart + keyword_col + '">' + "#define"                  +'' + fontend,
0173   "#undef":                   fontstart + keyword_col + '">' + "#undef"                   +'' + fontend,
0174   "#include":                 fontstart + keyword_col + '">' + "#include"                 +'' + fontend,
0175   "#line":                    fontstart + keyword_col + '">' + "#line"                    +'' + fontend,
0176   "#error":                   fontstart + keyword_col + '">' + "#error"                   +'' + fontend,
0177   "#pragma":                  fontstart + keyword_col + '">' + "#pragma"                  +'' + fontend,
0178   "#defined":                 fontstart + keyword_col + '">' + "#defined"                 +'' + fontend,
0179   "#__has_include":           fontstart + keyword_col + '">' + "#__has_include"           +'' + fontend,
0180   "#__has_cpp_attribute":     fontstart + keyword_col + '">' + "#__has_cpp_attribute"     +'' + fontend,
0181   "#export":                  fontstart + keyword_col + '">' + "#export"                  +'' + fontend,
0182   "#import":                  fontstart + keyword_col + '">' + "#import"                  +'' + fontend,
0183   "#module":                  fontstart + keyword_col + '">' + "#module"                  +'' + fontend,
0184 }

This mammoth was a manually curated list of C++ keywords. I’m still not convinced it works as well as I would have hoped either. It may get pull out at a later date.

0185 # highlight()
0186 #
0187 # Filter for code blocks and perform the highlighting.
0188 #
0189 # @param key The type of data it is as indicated by the markup.
0190 # @param value The payload of the data (unicode).
0191 # @param format ?
0192 # @param meta That dumb company Facebook turned into.
0193 # @return The new data structure, otherwise don't return anything and the
0194 # default is used.
0195 def highlight(key, value, format, meta) :
0196   if key == "CodeBlock" :
0197     lang = ""
0198     try :
0199       lang = value[0][1][0]
0200     except Exception as e :
0201       lang = ""
0202     syms = symbols
0203     if lang == "c" or lang == "cpp" or lang == "c++" or lang == "java" or lang == "javascript" :
0204       syms = {}
0205       syms.update(c_symbols)
0206       syms.update(symbols)
0207     value[1] = multiple_replace(value[1], syms)
0208     return RawBlock("html", "<pre>" + value[1] + "</pre>")

Here we want to process only CodeBlock elements and we try to pull out the language (it doesn’t always work). If we detect a C-fmaily language we add the keywords to be highlighted. We then perform the multiple_replace() operation.

One thing you will not find too much information about online is that RawBlock is actually how you can create a raw element. The first parameter is the format (of your target output format, in this case HTML), and the second is the raw source to the that element.

We use <pre> tags rather than <code> tags to ensure we can use <font> tags - but it does mean we are also responsible for our own parsing.

0209 # Source: https://stackoverflow.com/a/15448887/2847743
0210 def multiple_replace(string, rep_dict):
0211     pattern = re.compile("|".join([re.escape(k) for k in sorted(rep_dict,key=len,reverse=True)]), flags=re.DOTALL)
0212     return pattern.sub(lambda x: rep_dict[x.group(0)], string)

This was a cool piece of code I found on StackOverflow to accelerate the replacing of text. Beforehand I did it much like the original code by doing multiple parses. The benefit of this code is that it’s supposed to do a single well optimised parsing of the code.

0213 # Main entry into the program
0214 if __name__ == "__main__" :
0215   toJSONFilter(highlight)

Obviously we want to run everything through our highlight filter.

When I think this is more mature I will look towards setting up a repository for this code. Other improvements I want to make:

  1. Simplification - That massive list of keywords is awkward, I think I can do better.
  2. Comments - It would be nice to properly highlight comments in the source, such as /* this */ and // this.
  3. Line numbering - I want line numbers down the sides of the snippets and give people the ability to directly link to them. Perhaps in the future it could be useful to reference code blocks on the site.
  4. Better colours - I still think there is room for improvement on the colours used.

Let’s see what happens next!