Back in a really early versions of this website, there used to be a JavaScript-based code highlighting system. One of the major benefits of this syntax highlighter is that it was language agnostic, meaning that it had zero idea of the language it was parsing. This meant that it was fast and generic.
This generally worked well, but I was not happy that it relied on the client’s browser to process the web page. Back then, the entire web page was rendered using JavaScript - meaning that on some browsers, it simply didn’t load at all.
I had some success in the past using the Pandoc syntax highlighter, where you can run something like:
0001 pandoc input.md --highlight-style style.theme -o output.html
The problem with this is:
Whilst the built-in syntax highlighting is great, it’s just not reliably enough.
Out of all of the options for writing a Pandoc filter, Python seemed like the best option out of all the options I didn’t like. I did initially try parsing the JSON manually, but there appears to be something special going on with the JSON itself and I cannot be bothered to figure it out.
0002 from pandocfilters import toJSONFilter, RawBlock 0003 import re
Here we use the pandocfilters
module, and specifically use the toJSONFilter
(parse each part of the JSON markup) and RawBlock
(used to create the end element).
Foreshadowing: We also see the regex engine imported there, this was a later speed-up.
0004 symbolother_col = "#FF920D"; 0005 symbolmath_col = "#197BCE"; 0006 symbolnumber_col = "#E32929"; 0007 symbolpairs_col = "#AF18DB"; 0008 keyword_col = "#48BA1C";
These are the colours I chose after searching “vibrant HTML colours” into a search engine. It’s not yet the best colour scheme to match the light and dark modes of this website.
0009 fontstart = '<b><font color="'; 0010 fontend = '</font></b>'; 0011 0012 symbols = { 0013 # symbolother 0014 "&": fontstart + symbolother_col + '">&' + fontend, 0015 ".": fontstart + symbolother_col + '">.' + fontend, 0016 ",": fontstart + symbolother_col + '">,' + fontend, 0017 "?": fontstart + symbolother_col + '">?' + fontend, 0018 "!": fontstart + symbolother_col + '">!' + fontend, 0019 "£": fontstart + symbolother_col + '">£' + fontend, 0020 "$": fontstart + symbolother_col + '">$' + fontend, 0021 "%": fontstart + symbolother_col + '">%' + fontend, 0022 "@": fontstart + symbolother_col + '">@' + fontend, 0023 "~": fontstart + symbolother_col + '">~' + fontend, 0024 "|": fontstart + symbolother_col + '">|' + fontend, 0025 "#": fontstart + symbolother_col + '">#' + fontend, 0026 ":": fontstart + symbolother_col + '">:' + fontend, 0027 ";": fontstart + symbolother_col + '">;' + fontend, 0028 "=": fontstart + symbolother_col + '">=' + fontend, 0029 "_": fontstart + symbolother_col + '">_' + fontend, 0030 "\"": fontstart + symbolother_col + '">"' + fontend, 0031 "\'": fontstart + symbolother_col + '">\'' + fontend, 0032 # symbolpairs 0033 "[": fontstart + symbolpairs_col + '">[' + fontend, 0034 "]": fontstart + symbolpairs_col + '">]' + fontend, 0035 "{": fontstart + symbolpairs_col + '">{' + fontend, 0036 "}": fontstart + symbolpairs_col + '">}' + fontend, 0037 "(": fontstart + symbolpairs_col + '">(' + fontend, 0038 ")": fontstart + symbolpairs_col + '">)' + fontend, 0039 "<": fontstart + symbolpairs_col + '"><' + fontend, 0040 ">": fontstart + symbolpairs_col + '">>' + fontend, 0041 # symbolmath 0042 "+": fontstart + symbolmath_col + '">+' + fontend, 0043 "-": fontstart + symbolmath_col + '">-' + fontend, 0044 "*": fontstart + symbolmath_col + '">*' + fontend, 0045 "^": fontstart + symbolmath_col + '">^' + fontend, 0046 "/": fontstart + symbolmath_col + '">/' + fontend, 0047 "\\": fontstart + symbolmath_col + '">\\' + fontend, 0048 # symbolnumber 0049 "0": fontstart + symbolnumber_col + '">0' + fontend, 0050 "1": fontstart + symbolnumber_col + '">1' + fontend, 0051 "2": fontstart + symbolnumber_col + '">2' + fontend, 0052 "3": fontstart + symbolnumber_col + '">3' + fontend, 0053 "4": fontstart + symbolnumber_col + '">4' + fontend, 0054 "5": fontstart + symbolnumber_col + '">5' + fontend, 0055 "6": fontstart + symbolnumber_col + '">6' + fontend, 0056 "7": fontstart + symbolnumber_col + '">7' + fontend, 0057 "8": fontstart + symbolnumber_col + '">8' + fontend, 0058 "9": fontstart + symbolnumber_col + '">9' + fontend 0059 }
I pre-build a basic set of symbols in a dictionary. Later on we will pass this on to the regex engine.
0060 c_symbols = { 0061 # Keywords: https://en.cppreference.com/w/cpp/keyword 0062 "alignas": fontstart + keyword_col + '">' + "alignas" +'' + fontend, 0063 "alignas": fontstart + keyword_col + '">' + "alignas" +'' + fontend, 0064 "alignof": fontstart + keyword_col + '">' + "alignof" +'' + fontend, 0065 "and": fontstart + keyword_col + '">' + "and" +'' + fontend, 0066 "and_eq": fontstart + keyword_col + '">' + "and_eq" +'' + fontend, 0067 "asm": fontstart + keyword_col + '">' + "asm" +'' + fontend, 0068 "atomic_cancel": fontstart + keyword_col + '">' + "atomic_cancel" +'' + fontend, 0069 "atomic_commit": fontstart + keyword_col + '">' + "atomic_commit" +'' + fontend, 0070 "atomic_noexcept": fontstart + keyword_col + '">' + "atomic_noexcept" +'' + fontend, 0071 "auto": fontstart + keyword_col + '">' + "auto" +'' + fontend, 0072 "bitand": fontstart + keyword_col + '">' + "bitand" +'' + fontend, 0073 "bitor": fontstart + keyword_col + '">' + "bitor" +'' + fontend, 0074 "bool": fontstart + keyword_col + '">' + "bool" +'' + fontend, 0075 "break": fontstart + keyword_col + '">' + "break" +'' + fontend, 0076 "case": fontstart + keyword_col + '">' + "case" +'' + fontend, 0077 "catch": fontstart + keyword_col + '">' + "catch" +'' + fontend, 0078 "char": fontstart + keyword_col + '">' + "char" +'' + fontend, 0079 "char8_t": fontstart + keyword_col + '">' + "char8_t" +'' + fontend, 0080 "char16_t": fontstart + keyword_col + '">' + "char16_t" +'' + fontend, 0081 "char32_t": fontstart + keyword_col + '">' + "char32_t" +'' + fontend, 0082 "class": fontstart + keyword_col + '">' + "class" +'' + fontend, 0083 "compl": fontstart + keyword_col + '">' + "compl" +'' + fontend, 0084 "concept": fontstart + keyword_col + '">' + "concept" +'' + fontend, 0085 "const": fontstart + keyword_col + '">' + "const" +'' + fontend, 0086 "consteval": fontstart + keyword_col + '">' + "consteval" +'' + fontend, 0087 "constexpr": fontstart + keyword_col + '">' + "constexpr" +'' + fontend, 0088 "constinit": fontstart + keyword_col + '">' + "constinit" +'' + fontend, 0089 "const_cast": fontstart + keyword_col + '">' + "const_cast" +'' + fontend, 0090 "continue": fontstart + keyword_col + '">' + "continue" +'' + fontend, 0091 "co_await": fontstart + keyword_col + '">' + "co_await" +'' + fontend, 0092 "co_return": fontstart + keyword_col + '">' + "co_return" +'' + fontend, 0093 "co_yield": fontstart + keyword_col + '">' + "co_yield" +'' + fontend, 0094 "decltype": fontstart + keyword_col + '">' + "decltype" +'' + fontend, 0095 "default": fontstart + keyword_col + '">' + "default" +'' + fontend, 0096 "delete": fontstart + keyword_col + '">' + "delete" +'' + fontend, 0097 "do": fontstart + keyword_col + '">' + "do" +'' + fontend, 0098 "double": fontstart + keyword_col + '">' + "double" +'' + fontend, 0099 "dynamic_cast": fontstart + keyword_col + '">' + "dynamic_cast" +'' + fontend, 0100 "else": fontstart + keyword_col + '">' + "else" +'' + fontend, 0101 "enum": fontstart + keyword_col + '">' + "enum" +'' + fontend, 0102 "explicit": fontstart + keyword_col + '">' + "explicit" +'' + fontend, 0103 "export": fontstart + keyword_col + '">' + "export" +'' + fontend, 0104 "extern": fontstart + keyword_col + '">' + "extern" +'' + fontend, 0105 "false": fontstart + keyword_col + '">' + "false" +'' + fontend, 0106 "float": fontstart + keyword_col + '">' + "float" +'' + fontend, 0107 "for": fontstart + keyword_col + '">' + "for" +'' + fontend, 0108 "friend": fontstart + keyword_col + '">' + "friend" +'' + fontend, 0109 "goto": fontstart + keyword_col + '">' + "goto" +'' + fontend, 0110 "if": fontstart + keyword_col + '">' + "if" +'' + fontend, 0111 "inline": fontstart + keyword_col + '">' + "inline" +'' + fontend, 0112 "int": fontstart + keyword_col + '">' + "int" +'' + fontend, 0113 "long": fontstart + keyword_col + '">' + "long" +'' + fontend, 0114 "mutable": fontstart + keyword_col + '">' + "mutable" +'' + fontend, 0115 "namespace": fontstart + keyword_col + '">' + "namespace" +'' + fontend, 0116 "new": fontstart + keyword_col + '">' + "new" +'' + fontend, 0117 "noexcept": fontstart + keyword_col + '">' + "noexcept" +'' + fontend, 0118 "not": fontstart + keyword_col + '">' + "not" +'' + fontend, 0119 "not_eq": fontstart + keyword_col + '">' + "not_eq" +'' + fontend, 0120 "nullptr": fontstart + keyword_col + '">' + "nullptr" +'' + fontend, 0121 "operator": fontstart + keyword_col + '">' + "operator" +'' + fontend, 0122 "or": fontstart + keyword_col + '">' + "or" +'' + fontend, 0123 "or_eq": fontstart + keyword_col + '">' + "or_eq" +'' + fontend, 0124 "private": fontstart + keyword_col + '">' + "private" +'' + fontend, 0125 "protected": fontstart + keyword_col + '">' + "protected" +'' + fontend, 0126 "public": fontstart + keyword_col + '">' + "public" +'' + fontend, 0127 "reflexpr": fontstart + keyword_col + '">' + "reflexpr" +'' + fontend, 0128 "register": fontstart + keyword_col + '">' + "register" +'' + fontend, 0129 "reinterpret_cast": fontstart + keyword_col + '">' + "reinterpret_cast" +'' + fontend, 0130 "requires": fontstart + keyword_col + '">' + "requires" +'' + fontend, 0131 "return": fontstart + keyword_col + '">' + "return" +'' + fontend, 0132 "short": fontstart + keyword_col + '">' + "short" +'' + fontend, 0133 "signed": fontstart + keyword_col + '">' + "signed" +'' + fontend, 0134 "sizeof": fontstart + keyword_col + '">' + "sizeof" +'' + fontend, 0135 "static": fontstart + keyword_col + '">' + "static" +'' + fontend, 0136 "static_assert": fontstart + keyword_col + '">' + "static_assert" +'' + fontend, 0137 "static_cast": fontstart + keyword_col + '">' + "static_cast" +'' + fontend, 0138 "struct": fontstart + keyword_col + '">' + "struct" +'' + fontend, 0139 "switch": fontstart + keyword_col + '">' + "switch" +'' + fontend, 0140 "synchronized": fontstart + keyword_col + '">' + "synchronized" +'' + fontend, 0141 "template": fontstart + keyword_col + '">' + "template" +'' + fontend, 0142 "this": fontstart + keyword_col + '">' + "this" +'' + fontend, 0143 "thread_local": fontstart + keyword_col + '">' + "thread_local" +'' + fontend, 0144 "throw": fontstart + keyword_col + '">' + "throw" +'' + fontend, 0145 "true": fontstart + keyword_col + '">' + "true" +'' + fontend, 0146 "try": fontstart + keyword_col + '">' + "try" +'' + fontend, 0147 "typedef": fontstart + keyword_col + '">' + "typedef" +'' + fontend, 0148 "typeid": fontstart + keyword_col + '">' + "typeid" +'' + fontend, 0149 "typename": fontstart + keyword_col + '">' + "typename" +'' + fontend, 0150 "union": fontstart + keyword_col + '">' + "union" +'' + fontend, 0151 "unsigned": fontstart + keyword_col + '">' + "unsigned" +'' + fontend, 0152 "using": fontstart + keyword_col + '">' + "using" +'' + fontend, 0153 "virtual": fontstart + keyword_col + '">' + "virtual" +'' + fontend, 0154 "void": fontstart + keyword_col + '">' + "void" +'' + fontend, 0155 "volatile": fontstart + keyword_col + '">' + "volatile" +'' + fontend, 0156 "wchar_t": fontstart + keyword_col + '">' + "wchar_t" +'' + fontend, 0157 "while": fontstart + keyword_col + '">' + "while" +'' + fontend, 0158 "xor": fontstart + keyword_col + '">' + "xor" +'' + fontend, 0159 "xor_eq": fontstart + keyword_col + '">' + "xor_eq" +'' + fontend, 0160 "final": fontstart + keyword_col + '">' + "final" +'' + fontend, 0161 "override": fontstart + keyword_col + '">' + "override" +'' + fontend, 0162 "transaction_safe": fontstart + keyword_col + '">' + "transaction_safe" +'' + fontend, 0163 "transaction_safe_dynamic": fontstart + keyword_col + '">' + "transaction_safe_dynamic" +'' + fontend, 0164 "import": fontstart + keyword_col + '">' + "import" +'' + fontend, 0165 "module": fontstart + keyword_col + '">' + "module" +'' + fontend, 0166 "#if": fontstart + keyword_col + '">' + "#if" +'' + fontend, 0167 "#elif": fontstart + keyword_col + '">' + "#elif" +'' + fontend, 0168 "#else": fontstart + keyword_col + '">' + "#else" +'' + fontend, 0169 "#endif": fontstart + keyword_col + '">' + "#endif" +'' + fontend, 0170 "#ifdef": fontstart + keyword_col + '">' + "#ifdef" +'' + fontend, 0171 "#ifndef": fontstart + keyword_col + '">' + "#ifndef" +'' + fontend, 0172 "#define": fontstart + keyword_col + '">' + "#define" +'' + fontend, 0173 "#undef": fontstart + keyword_col + '">' + "#undef" +'' + fontend, 0174 "#include": fontstart + keyword_col + '">' + "#include" +'' + fontend, 0175 "#line": fontstart + keyword_col + '">' + "#line" +'' + fontend, 0176 "#error": fontstart + keyword_col + '">' + "#error" +'' + fontend, 0177 "#pragma": fontstart + keyword_col + '">' + "#pragma" +'' + fontend, 0178 "#defined": fontstart + keyword_col + '">' + "#defined" +'' + fontend, 0179 "#__has_include": fontstart + keyword_col + '">' + "#__has_include" +'' + fontend, 0180 "#__has_cpp_attribute": fontstart + keyword_col + '">' + "#__has_cpp_attribute" +'' + fontend, 0181 "#export": fontstart + keyword_col + '">' + "#export" +'' + fontend, 0182 "#import": fontstart + keyword_col + '">' + "#import" +'' + fontend, 0183 "#module": fontstart + keyword_col + '">' + "#module" +'' + fontend, 0184 }
This mammoth was a manually curated list of C++ keywords. I’m still not convinced it works as well as I would have hoped either. It may get pull out at a later date.
0185 # highlight() 0186 # 0187 # Filter for code blocks and perform the highlighting. 0188 # 0189 # @param key The type of data it is as indicated by the markup. 0190 # @param value The payload of the data (unicode). 0191 # @param format ? 0192 # @param meta That dumb company Facebook turned into. 0193 # @return The new data structure, otherwise don't return anything and the 0194 # default is used. 0195 def highlight(key, value, format, meta) : 0196 if key == "CodeBlock" : 0197 lang = "" 0198 try : 0199 lang = value[0][1][0] 0200 except Exception as e : 0201 lang = "" 0202 syms = symbols 0203 if lang == "c" or lang == "cpp" or lang == "c++" or lang == "java" or lang == "javascript" : 0204 syms = {} 0205 syms.update(c_symbols) 0206 syms.update(symbols) 0207 value[1] = multiple_replace(value[1], syms) 0208 return RawBlock("html", "<pre>" + value[1] + "</pre>")
Here we want to process only CodeBlock
elements and we try to pull out the language (it doesn’t always work). If we detect a C-fmaily language we add the keywords to be highlighted. We then perform the multiple_replace()
operation.
One thing you will not find too much information about online is that RawBlock
is actually how you can create a raw element. The first parameter is the format (of your target output format, in this case HTML), and the second is the raw source to the that element.
We use <pre>
tags rather than <code>
tags to ensure we can use <font>
tags - but it does mean we are also responsible for our own parsing.
0209 # Source: https://stackoverflow.com/a/15448887/2847743 0210 def multiple_replace(string, rep_dict): 0211 pattern = re.compile("|".join([re.escape(k) for k in sorted(rep_dict,key=len,reverse=True)]), flags=re.DOTALL) 0212 return pattern.sub(lambda x: rep_dict[x.group(0)], string)
This was a cool piece of code I found on StackOverflow to accelerate the replacing of text. Beforehand I did it much like the original code by doing multiple parses. The benefit of this code is that it’s supposed to do a single well optimised parsing of the code.
0213 # Main entry into the program 0214 if __name__ == "__main__" : 0215 toJSONFilter(highlight)
Obviously we want to run everything through our highlight filter.
When I think this is more mature I will look towards setting up a repository for this code. Other improvements I want to make:
/* this */
and // this
.Let’s see what happens next!