Coffee Space 
Back in a really early versions of this website, there used to be a JavaScript-based code highlighting system. One of the major benefits of this syntax highlighter is that it was language agnostic, meaning that it had zero idea of the language it was parsing. This meant that it was fast and generic.
This generally worked well, but I was not happy that it relied on the client’s browser to process the web page. Back then, the entire web page was rendered using JavaScript - meaning that on some browsers, it simply didn’t load at all.
I had some success in the past using the Pandoc syntax highlighter, where you can run something like:
0001 pandoc input.md --highlight-style style.theme -o output.html
The problem with this is:
Whilst the built-in syntax highlighting is great, it’s just not reliably enough.
Out of all of the options for writing a Pandoc filter, Python seemed like the best option out of all the options I didn’t like. I did initially try parsing the JSON manually, but there appears to be something special going on with the JSON itself and I cannot be bothered to figure it out.
0002 from pandocfilters import toJSONFilter, RawBlock 0003 import re
Here we use the pandocfilters
module, and specifically use the toJSONFilter (parse each
part of the JSON markup) and RawBlock (used to create the
end element).
Foreshadowing: We also see the regex engine imported there, this was a later speed-up.
0004 symbolother_col = "#FF920D"; 0005 symbolmath_col = "#197BCE"; 0006 symbolnumber_col = "#E32929"; 0007 symbolpairs_col = "#AF18DB"; 0008 keyword_col = "#48BA1C";
These are the colours I chose after searching “vibrant HTML colours” into a search engine. It’s not yet the best colour scheme to match the light and dark modes of this website.
0009 fontstart = '<b><font color="';
0010 fontend = '</font></b>';
0011
0012 symbols = {
0013 # symbolother
0014 "&": fontstart + symbolother_col + '">&' + fontend,
0015 ".": fontstart + symbolother_col + '">.' + fontend,
0016 ",": fontstart + symbolother_col + '">,' + fontend,
0017 "?": fontstart + symbolother_col + '">?' + fontend,
0018 "!": fontstart + symbolother_col + '">!' + fontend,
0019 "£": fontstart + symbolother_col + '">£' + fontend,
0020 "$": fontstart + symbolother_col + '">$' + fontend,
0021 "%": fontstart + symbolother_col + '">%' + fontend,
0022 "@": fontstart + symbolother_col + '">@' + fontend,
0023 "~": fontstart + symbolother_col + '">~' + fontend,
0024 "|": fontstart + symbolother_col + '">|' + fontend,
0025 "#": fontstart + symbolother_col + '">#' + fontend,
0026 ":": fontstart + symbolother_col + '">:' + fontend,
0027 ";": fontstart + symbolother_col + '">;' + fontend,
0028 "=": fontstart + symbolother_col + '">=' + fontend,
0029 "_": fontstart + symbolother_col + '">_' + fontend,
0030 "\"": fontstart + symbolother_col + '">"' + fontend,
0031 "\'": fontstart + symbolother_col + '">\'' + fontend,
0032 # symbolpairs
0033 "[": fontstart + symbolpairs_col + '">[' + fontend,
0034 "]": fontstart + symbolpairs_col + '">]' + fontend,
0035 "{": fontstart + symbolpairs_col + '">{' + fontend,
0036 "}": fontstart + symbolpairs_col + '">}' + fontend,
0037 "(": fontstart + symbolpairs_col + '">(' + fontend,
0038 ")": fontstart + symbolpairs_col + '">)' + fontend,
0039 "<": fontstart + symbolpairs_col + '"><' + fontend,
0040 ">": fontstart + symbolpairs_col + '">>' + fontend,
0041 # symbolmath
0042 "+": fontstart + symbolmath_col + '">+' + fontend,
0043 "-": fontstart + symbolmath_col + '">-' + fontend,
0044 "*": fontstart + symbolmath_col + '">*' + fontend,
0045 "^": fontstart + symbolmath_col + '">^' + fontend,
0046 "/": fontstart + symbolmath_col + '">/' + fontend,
0047 "\\": fontstart + symbolmath_col + '">\\' + fontend,
0048 # symbolnumber
0049 "0": fontstart + symbolnumber_col + '">0' + fontend,
0050 "1": fontstart + symbolnumber_col + '">1' + fontend,
0051 "2": fontstart + symbolnumber_col + '">2' + fontend,
0052 "3": fontstart + symbolnumber_col + '">3' + fontend,
0053 "4": fontstart + symbolnumber_col + '">4' + fontend,
0054 "5": fontstart + symbolnumber_col + '">5' + fontend,
0055 "6": fontstart + symbolnumber_col + '">6' + fontend,
0056 "7": fontstart + symbolnumber_col + '">7' + fontend,
0057 "8": fontstart + symbolnumber_col + '">8' + fontend,
0058 "9": fontstart + symbolnumber_col + '">9' + fontend
0059 }
I pre-build a basic set of symbols in a dictionary. Later on we will pass this on to the regex engine.
0060 c_symbols = {
0061 # Keywords: https://en.cppreference.com/w/cpp/keyword
0062 "alignas": fontstart + keyword_col + '">' + "alignas" +'' + fontend,
0063 "alignas": fontstart + keyword_col + '">' + "alignas" +'' + fontend,
0064 "alignof": fontstart + keyword_col + '">' + "alignof" +'' + fontend,
0065 "and": fontstart + keyword_col + '">' + "and" +'' + fontend,
0066 "and_eq": fontstart + keyword_col + '">' + "and_eq" +'' + fontend,
0067 "asm": fontstart + keyword_col + '">' + "asm" +'' + fontend,
0068 "atomic_cancel": fontstart + keyword_col + '">' + "atomic_cancel" +'' + fontend,
0069 "atomic_commit": fontstart + keyword_col + '">' + "atomic_commit" +'' + fontend,
0070 "atomic_noexcept": fontstart + keyword_col + '">' + "atomic_noexcept" +'' + fontend,
0071 "auto": fontstart + keyword_col + '">' + "auto" +'' + fontend,
0072 "bitand": fontstart + keyword_col + '">' + "bitand" +'' + fontend,
0073 "bitor": fontstart + keyword_col + '">' + "bitor" +'' + fontend,
0074 "bool": fontstart + keyword_col + '">' + "bool" +'' + fontend,
0075 "break": fontstart + keyword_col + '">' + "break" +'' + fontend,
0076 "case": fontstart + keyword_col + '">' + "case" +'' + fontend,
0077 "catch": fontstart + keyword_col + '">' + "catch" +'' + fontend,
0078 "char": fontstart + keyword_col + '">' + "char" +'' + fontend,
0079 "char8_t": fontstart + keyword_col + '">' + "char8_t" +'' + fontend,
0080 "char16_t": fontstart + keyword_col + '">' + "char16_t" +'' + fontend,
0081 "char32_t": fontstart + keyword_col + '">' + "char32_t" +'' + fontend,
0082 "class": fontstart + keyword_col + '">' + "class" +'' + fontend,
0083 "compl": fontstart + keyword_col + '">' + "compl" +'' + fontend,
0084 "concept": fontstart + keyword_col + '">' + "concept" +'' + fontend,
0085 "const": fontstart + keyword_col + '">' + "const" +'' + fontend,
0086 "consteval": fontstart + keyword_col + '">' + "consteval" +'' + fontend,
0087 "constexpr": fontstart + keyword_col + '">' + "constexpr" +'' + fontend,
0088 "constinit": fontstart + keyword_col + '">' + "constinit" +'' + fontend,
0089 "const_cast": fontstart + keyword_col + '">' + "const_cast" +'' + fontend,
0090 "continue": fontstart + keyword_col + '">' + "continue" +'' + fontend,
0091 "co_await": fontstart + keyword_col + '">' + "co_await" +'' + fontend,
0092 "co_return": fontstart + keyword_col + '">' + "co_return" +'' + fontend,
0093 "co_yield": fontstart + keyword_col + '">' + "co_yield" +'' + fontend,
0094 "decltype": fontstart + keyword_col + '">' + "decltype" +'' + fontend,
0095 "default": fontstart + keyword_col + '">' + "default" +'' + fontend,
0096 "delete": fontstart + keyword_col + '">' + "delete" +'' + fontend,
0097 "do": fontstart + keyword_col + '">' + "do" +'' + fontend,
0098 "double": fontstart + keyword_col + '">' + "double" +'' + fontend,
0099 "dynamic_cast": fontstart + keyword_col + '">' + "dynamic_cast" +'' + fontend,
0100 "else": fontstart + keyword_col + '">' + "else" +'' + fontend,
0101 "enum": fontstart + keyword_col + '">' + "enum" +'' + fontend,
0102 "explicit": fontstart + keyword_col + '">' + "explicit" +'' + fontend,
0103 "export": fontstart + keyword_col + '">' + "export" +'' + fontend,
0104 "extern": fontstart + keyword_col + '">' + "extern" +'' + fontend,
0105 "false": fontstart + keyword_col + '">' + "false" +'' + fontend,
0106 "float": fontstart + keyword_col + '">' + "float" +'' + fontend,
0107 "for": fontstart + keyword_col + '">' + "for" +'' + fontend,
0108 "friend": fontstart + keyword_col + '">' + "friend" +'' + fontend,
0109 "goto": fontstart + keyword_col + '">' + "goto" +'' + fontend,
0110 "if": fontstart + keyword_col + '">' + "if" +'' + fontend,
0111 "inline": fontstart + keyword_col + '">' + "inline" +'' + fontend,
0112 "int": fontstart + keyword_col + '">' + "int" +'' + fontend,
0113 "long": fontstart + keyword_col + '">' + "long" +'' + fontend,
0114 "mutable": fontstart + keyword_col + '">' + "mutable" +'' + fontend,
0115 "namespace": fontstart + keyword_col + '">' + "namespace" +'' + fontend,
0116 "new": fontstart + keyword_col + '">' + "new" +'' + fontend,
0117 "noexcept": fontstart + keyword_col + '">' + "noexcept" +'' + fontend,
0118 "not": fontstart + keyword_col + '">' + "not" +'' + fontend,
0119 "not_eq": fontstart + keyword_col + '">' + "not_eq" +'' + fontend,
0120 "nullptr": fontstart + keyword_col + '">' + "nullptr" +'' + fontend,
0121 "operator": fontstart + keyword_col + '">' + "operator" +'' + fontend,
0122 "or": fontstart + keyword_col + '">' + "or" +'' + fontend,
0123 "or_eq": fontstart + keyword_col + '">' + "or_eq" +'' + fontend,
0124 "private": fontstart + keyword_col + '">' + "private" +'' + fontend,
0125 "protected": fontstart + keyword_col + '">' + "protected" +'' + fontend,
0126 "public": fontstart + keyword_col + '">' + "public" +'' + fontend,
0127 "reflexpr": fontstart + keyword_col + '">' + "reflexpr" +'' + fontend,
0128 "register": fontstart + keyword_col + '">' + "register" +'' + fontend,
0129 "reinterpret_cast": fontstart + keyword_col + '">' + "reinterpret_cast" +'' + fontend,
0130 "requires": fontstart + keyword_col + '">' + "requires" +'' + fontend,
0131 "return": fontstart + keyword_col + '">' + "return" +'' + fontend,
0132 "short": fontstart + keyword_col + '">' + "short" +'' + fontend,
0133 "signed": fontstart + keyword_col + '">' + "signed" +'' + fontend,
0134 "sizeof": fontstart + keyword_col + '">' + "sizeof" +'' + fontend,
0135 "static": fontstart + keyword_col + '">' + "static" +'' + fontend,
0136 "static_assert": fontstart + keyword_col + '">' + "static_assert" +'' + fontend,
0137 "static_cast": fontstart + keyword_col + '">' + "static_cast" +'' + fontend,
0138 "struct": fontstart + keyword_col + '">' + "struct" +'' + fontend,
0139 "switch": fontstart + keyword_col + '">' + "switch" +'' + fontend,
0140 "synchronized": fontstart + keyword_col + '">' + "synchronized" +'' + fontend,
0141 "template": fontstart + keyword_col + '">' + "template" +'' + fontend,
0142 "this": fontstart + keyword_col + '">' + "this" +'' + fontend,
0143 "thread_local": fontstart + keyword_col + '">' + "thread_local" +'' + fontend,
0144 "throw": fontstart + keyword_col + '">' + "throw" +'' + fontend,
0145 "true": fontstart + keyword_col + '">' + "true" +'' + fontend,
0146 "try": fontstart + keyword_col + '">' + "try" +'' + fontend,
0147 "typedef": fontstart + keyword_col + '">' + "typedef" +'' + fontend,
0148 "typeid": fontstart + keyword_col + '">' + "typeid" +'' + fontend,
0149 "typename": fontstart + keyword_col + '">' + "typename" +'' + fontend,
0150 "union": fontstart + keyword_col + '">' + "union" +'' + fontend,
0151 "unsigned": fontstart + keyword_col + '">' + "unsigned" +'' + fontend,
0152 "using": fontstart + keyword_col + '">' + "using" +'' + fontend,
0153 "virtual": fontstart + keyword_col + '">' + "virtual" +'' + fontend,
0154 "void": fontstart + keyword_col + '">' + "void" +'' + fontend,
0155 "volatile": fontstart + keyword_col + '">' + "volatile" +'' + fontend,
0156 "wchar_t": fontstart + keyword_col + '">' + "wchar_t" +'' + fontend,
0157 "while": fontstart + keyword_col + '">' + "while" +'' + fontend,
0158 "xor": fontstart + keyword_col + '">' + "xor" +'' + fontend,
0159 "xor_eq": fontstart + keyword_col + '">' + "xor_eq" +'' + fontend,
0160 "final": fontstart + keyword_col + '">' + "final" +'' + fontend,
0161 "override": fontstart + keyword_col + '">' + "override" +'' + fontend,
0162 "transaction_safe": fontstart + keyword_col + '">' + "transaction_safe" +'' + fontend,
0163 "transaction_safe_dynamic": fontstart + keyword_col + '">' + "transaction_safe_dynamic" +'' + fontend,
0164 "import": fontstart + keyword_col + '">' + "import" +'' + fontend,
0165 "module": fontstart + keyword_col + '">' + "module" +'' + fontend,
0166 "#if": fontstart + keyword_col + '">' + "#if" +'' + fontend,
0167 "#elif": fontstart + keyword_col + '">' + "#elif" +'' + fontend,
0168 "#else": fontstart + keyword_col + '">' + "#else" +'' + fontend,
0169 "#endif": fontstart + keyword_col + '">' + "#endif" +'' + fontend,
0170 "#ifdef": fontstart + keyword_col + '">' + "#ifdef" +'' + fontend,
0171 "#ifndef": fontstart + keyword_col + '">' + "#ifndef" +'' + fontend,
0172 "#define": fontstart + keyword_col + '">' + "#define" +'' + fontend,
0173 "#undef": fontstart + keyword_col + '">' + "#undef" +'' + fontend,
0174 "#include": fontstart + keyword_col + '">' + "#include" +'' + fontend,
0175 "#line": fontstart + keyword_col + '">' + "#line" +'' + fontend,
0176 "#error": fontstart + keyword_col + '">' + "#error" +'' + fontend,
0177 "#pragma": fontstart + keyword_col + '">' + "#pragma" +'' + fontend,
0178 "#defined": fontstart + keyword_col + '">' + "#defined" +'' + fontend,
0179 "#__has_include": fontstart + keyword_col + '">' + "#__has_include" +'' + fontend,
0180 "#__has_cpp_attribute": fontstart + keyword_col + '">' + "#__has_cpp_attribute" +'' + fontend,
0181 "#export": fontstart + keyword_col + '">' + "#export" +'' + fontend,
0182 "#import": fontstart + keyword_col + '">' + "#import" +'' + fontend,
0183 "#module": fontstart + keyword_col + '">' + "#module" +'' + fontend,
0184 }
This mammoth was a manually curated list of C++ keywords. I’m still not convinced it works as well as I would have hoped either. It may get pull out at a later date.
0185 # highlight()
0186 #
0187 # Filter for code blocks and perform the highlighting.
0188 #
0189 # @param key The type of data it is as indicated by the markup.
0190 # @param value The payload of the data (unicode).
0191 # @param format ?
0192 # @param meta That dumb company Facebook turned into.
0193 # @return The new data structure, otherwise don't return anything and the
0194 # default is used.
0195 def highlight(key, value, format, meta) :
0196 if key == "CodeBlock" :
0197 lang = ""
0198 try :
0199 lang = value[0][1][0]
0200 except Exception as e :
0201 lang = ""
0202 syms = symbols
0203 if lang == "c" or lang == "cpp" or lang == "c++" or lang == "java" or lang == "javascript" :
0204 syms = {}
0205 syms.update(c_symbols)
0206 syms.update(symbols)
0207 value[1] = multiple_replace(value[1], syms)
0208 return RawBlock("html", "<pre>" + value[1] + "</pre>")
Here we want to process only CodeBlock elements and we
try to pull out the language (it doesn’t always work). If we detect a
C-fmaily language we add the keywords to be highlighted. We then perform
the multiple_replace() operation.
One thing you will not find too much information about online is that
RawBlock is actually how you can create a raw element. The
first parameter is the format (of your target output format, in this
case HTML), and the second is the raw source to the that element.
We use <pre> tags rather than
<code> tags to ensure we can use
<font> tags - but it does mean we are also
responsible for our own parsing.
0209 # Source: https://stackoverflow.com/a/15448887/2847743
0210 def multiple_replace(string, rep_dict):
0211 pattern = re.compile("|".join([re.escape(k) for k in sorted(rep_dict,key=len,reverse=True)]), flags=re.DOTALL)
0212 return pattern.sub(lambda x: rep_dict[x.group(0)], string)
This was a cool piece of code I found on StackOverflow to accelerate the replacing of text. Beforehand I did it much like the original code by doing multiple parses. The benefit of this code is that it’s supposed to do a single well optimised parsing of the code.
0213 # Main entry into the program 0214 if __name__ == "__main__" : 0215 toJSONFilter(highlight)
Obviously we want to run everything through our highlight filter.
When I think this is more mature I will look towards setting up a repository for this code. Other improvements I want to make:
/* this */ and
// this.Let’s see what happens next!