NOTE: Updated 01/12/2016.
I wanted to do a speed test to see how fast it is to process a large array for an idea I had for a neural network that is updated by some kernel method. I love and am very comfortable with Java, but am not completely against the idea of using another language if there is some benefit to doing so.
A language that is often pouted as being very fast is of course C - the reason why it is used in the Linux Kernel and appears in many other high end places. I decided to do a speed test in order to figure out whether the change would be worth it, so I devised a test to find out how much faster C would be that Java for processing large arrays.
There are some “gotchas” with this processing idea for the array - it has to be processed from the top to the bottom of the 2D array due to how the neurons speak to one another.
0001 ___ ___ ___ 0002 | L | C | R | 0003 |___|___|___| 0004 | | N | | 0005 |___|___|___| 0006 | | | | 0007 |___|___|___|
For the structure, N
is equal to a summation of A
, B
and C
- therefore they need to be processed first. This means that there is a limitation as to how parallelised the system can be made and care has to be taken in the threads, as will be seen in the below “Code” section.
For the first two examples, hopefully you will see that the code is very much similar in many respects and performance wise they have been given equal opportunity.
Also worth noting is that the 2 dimensional array has been unravelled into a 1 dimensional array for the purpose of speed. As the array is pretty large (~1 billion bytes) it makes a different.
0008 #include <stdio.h> 0009 #include <time.h> 0010 0011 #define size 30000 0012 #define total (size * size) 0013 #define cycles 100 0014 0015 char* grid; 0016 clock_t start; 0017 clock_t end; 0018 0019 int main(int argc, char** argv){ 0020 printf("\nBillion Neuron Speed Test\n"); 0021 printf("\nTotal neurons -> %i", total); 0022 printf("\nPer side -> %i", size); 0023 printf("\nCycles to test -> %i", cycles); 0024 printf("\nInitialising variable grid..."); 0025 grid = (char*)malloc(sizeof(char) * total); 0026 printf("\nGrid initialised!"); 0027 printf("\nExecuting speed test..."); 0028 start = clock(); 0029 int z; 0030 int x; 0031 for(z = 0; z < cycles; z++){ 0032 for(x = 0; x < total; x++){ 0033 grid[x] = x % 64; 0034 } 0035 } 0036 end = clock(); 0037 printf("\nEnd of speed test!"); 0038 printf("\nTest took %d ms", ((end - start) / 1000)); 0039 printf("\n%f cycles per sec", ((double)(cycles * 1000000) / (double)(end - start))); 0040 }
0041 public class Speed{ 0042 private static final int size = 30000; 0043 private static final int total = size * size; 0044 private static final int cycles = 100; 0045 private static byte[] grid; 0046 private static long start; 0047 private static long end; 0048 0049 public static void main(String[] args){ 0050 System.out.println("Billion Neuron Speed Test\n"); 0051 System.out.println("Total neurons -> " + total); 0052 System.out.println("Per side -> " + size); 0053 System.out.println("Cycles to test -> " + cycles); 0054 System.out.println("Initialising variable grid..."); 0055 grid = new byte[total]; 0056 System.out.println("Grid initialised!"); 0057 System.out.println("Executing speed test..."); 0058 start = System.currentTimeMillis(); 0059 int z; 0060 int x; 0061 for(z = 0; z < cycles; z++){ 0062 for(x = 0; x < total; x++){ 0063 grid[x] = (byte)(x % 64); 0064 } 0065 } 0066 end = System.currentTimeMillis(); 0067 System.out.println("End of speed test!"); 0068 System.out.println("Test took " + (end - start) + "ms"); 0069 System.out.println(((double)(cycles * 1000) / (double)(end - start)) + " cycles per sec"); 0070 } 0071 }
0072 import java.util.concurrent.atomic.AtomicInteger; 0073 0074 public class Speed2 extends Thread{ 0075 private static final int numThreads = 3; 0076 private static final int size = 30000; 0077 private static final int total = size * size; 0078 private static final int cycles = 100; 0079 private static AtomicInteger next = new AtomicInteger(0); 0080 private static int cDone = numThreads; 0081 private static AtomicInteger done = new AtomicInteger(0); 0082 private static byte[] grid; 0083 private static long start; 0084 private static long end; 0085 0086 public static void main(String[] args){ 0087 System.out.println("Billion Neuron Speed Test\n"); 0088 System.out.println("Total neurons -> " + total); 0089 System.out.println("Per side -> " + size); 0090 System.out.println("Cycles to test -> " + cycles); 0091 System.out.println("Initialising variable grid..."); 0092 grid = new byte[total]; 0093 System.out.println("Grid initialised!"); 0094 System.out.println("Executing speed test..."); 0095 start = System.currentTimeMillis(); 0096 /* NOTE: Size must be a multiple of what is being processed! */ 0097 int div = size / numThreads; 0098 Thread[] threads = new Thread[numThreads]; 0099 for(int x = 0; x < numThreads; x++){ 0100 threads[x] = new Speed2(div * x, div * (x + 1), 0, size); 0101 threads[x].start(); 0102 } 0103 while(done.get() < (numThreads * cycles)){ 0104 /* Process the threads */ 0105 if(done.get() == cDone){ 0106 /* Store future exit cycles */ 0107 cDone = done.get() + numThreads; 0108 /* As we have all threads waiting, set the count to zero */ 0109 next.set(0); 0110 /* Unfreeze all threads */ 0111 for(Thread t : threads){ 0112 t.interrupt(); 0113 } 0114 } 0115 } 0116 end = System.currentTimeMillis(); 0117 System.out.println("End of speed test!"); 0118 System.out.println("Test took " + (end - start) + "ms"); 0119 System.out.println(((double)(cycles * 1000) / (double)(end - start)) + " cycles per sec"); 0120 } 0121 0122 private int startX; 0123 private int endX; 0124 private int startY; 0125 private int endY; 0126 0127 public Speed2(int startX, int endX, int startY, int endY){ 0128 this.startX = startX; 0129 this.endX = endX; 0130 this.startY = startY; 0131 this.endY = endY; 0132 } 0133 0134 public void run(){ 0135 int c; 0136 int x; 0137 int y; 0138 int z; 0139 for(c = 0; c < cycles; c++){ 0140 for(y = startY; y < endY; y++){ 0141 z = y * size; 0142 for(x = startX; x < endX; x++){ 0143 grid[z + x] = (byte)(x % 64); 0144 } 0145 /* Update next counter */ 0146 next.incrementAndGet(); 0147 /* Wait for other threads to finish */ 0148 while((next.get() / numThreads) <= y){ 0149 /* Wait in small loop */ 0150 } 0151 } 0152 /* Let main loop we are done */ 0153 done.incrementAndGet(); 0154 /* Allow main thread to perform clean-up service */ 0155 try{ 0156 /* This thread should not be asleep that long! */ 0157 Thread.sleep(1000); 0158 }catch(InterruptedException e){ 0159 /* Do nothing */ 0160 } 0161 } 0162 } 0163 }
0164 $ gcc Speed.c 0165 $ ./a.out 0166 Billion Neuron Speed Test 0167 0168 Total neurons -> 900000000 0169 Per side -> 30000 0170 Cycles to test -> 100 0171 Initialising variable grid... 0172 Grid initialised! 0173 Executing speed test... 0174 End of speed test! 0175 Test took 281309 ms 0176 0.355481 cycles per sec
0177 $ javac Speed.java 0178 $ java Speed 0179 Billion Neuron Speed Test 0180 0181 Total neurons -> 900000000 0182 Per side -> 30000 0183 Cycles to test -> 100 0184 Initialising variable grid... 0185 Grid initialised! 0186 Executing speed test... 0187 End of speed test! 0188 Test took 135388ms 0189 0.7386178981889089 cycles per sec
0190 $ javac Speed2.java 0191 $ java Speed2 0192 Billion Neuron Speed Test 0193 0194 Total neurons -> 900000000 0195 Per side -> 30000 0196 Cycles to test -> 100 0197 Initialising variable grid... 0198 Grid initialised! 0199 Executing speed test... 0200 End of speed test! 0201 Test took 228783ms 0202 0.43709541355782555 cycles per sec
-O3
)0203 $ gcc -O3 Speed.c 0204 $ ./a.out 0205 Billion Neuron Speed Test 0206 0207 Total neurons -> 900000000 0208 Per side -> 30000 0209 Cycles to test -> 100 0210 Initialising variable grid... 0211 Grid initialised! 0212 Executing speed test... 0213 End of speed test! 0214 Test took 73882 ms 0215 1.353493 cycles per sec
Well… I knew the JIT is pretty cool, but I was not expecting a single Threaded Java application to out perform a C application (NOTE: Before optimising the source code.) of near equal code at all. This is very unexpected. I hope this is not because Java is taking some shortcuts as it knows the byte arrays are never read from - this was the purpose of setting the array to be some function as opposed to a pre-determined value such as zero (which Java may have skipped over or done something cool).
I did expect the multi-threaded Java code to be worse that the single thread, as you always pay the cost of the worst thread, meaning if any of them are paused to execute different tasks the whole program takes the performance hit. For even larger tasks this code may work much more efficiently.