Login

jonsmirl · (This post was last modified: 12-08-2018, 02:13 AM by jonsmirl.)

I was able to compile tensorflow lite and run it on the V5. I used two different models, the first one has been quantized to only use integers, it is the recommended mode for phones/SOCs. The second model is more complex and it uses floating point. On the V5 the simple model runs in 1s and uses all four cores, the complex model runs in 52s. On my desktop GPU the complex model runs in under 0.1s.

These instructions were helpful.
https://medium.com/@haraldfernengel/comp...6b1b98e646

The input is color images, 224 x 224 pixels. In both cases the test image was matched against 1,000 possible labels.

Exe size is 1.5MB, model size is 4.3MB for quant model. Complex model is 95MB. These models started off as desktop type models and then you compile them to run on the tensorflow lite run-time.

# time ./label_image
Loaded model ./mobilenet_quant_v1_224.tflite
resolved reporter
invoked
average time: 1049.15 ms
0.666667: 458 bow tie
0.290196: 653 military uniform
0.0117647: 835 suit
0.00784314: 611 jersey
0.00392157: 922 book jacket
real 0m 1.19s
user 0m 3.27s
sys 0m 0.00s
#
# time ./label_image
Loaded model ./inceptionv3_slim_2016.tflite
resolved reporter
invoked
average time: 51670.8 ms
8.80859: 653 military uniform
5.90491: 668 mortarboard
5.367: 401 academic gown
5.18728: 835 suit
4.91802: 458 bow tie
real 0m 53.09s
user 2m 59.21s
sys 0m 0.86s
#

The AI kernel is surprisingly small, only 12,000 lines of code.

I don't have any info on how the EVE hardware works, but I suspect it is a fixed point system similar to the first test. Based on my testing so far, my approach would be to first familiarize myself with tensorflow lite on ARM. Then as I understand how it works, look into modifying it to call out into the EVE hardware where appropriate.

It looks like you port it to specialized hardware by writing routines to implement these operations. Each of these operations has a C++ implementation in the source code. So you can go one operation at a time and reimplement it with the V5 hardware support. There are 63 opcodes in the source.

Porting the lite version appears to be much easier than porting the full version since the lite version assumes fixed, precompiled models and does not have a JIT to dynamically adapt during training.

TfLiteRegistration* Register_RELU();
TfLiteRegistration* Register_RELU_N1_TO_1();
TfLiteRegistration* Register_RELU6();
TfLiteRegistration* Register_TANH();
TfLiteRegistration* Register_LOGISTIC();
TfLiteRegistration* Register_AVERAGE_POOL_2D();
TfLiteRegistration* Register_MAX_POOL_2D();
TfLiteRegistration* Register_L2_POOL_2D();
TfLiteRegistration* Register_CONV_2D();
TfLiteRegistration* Register_DEPTHWISE_CONV_2D();
TfLiteRegistration* Register_SVDF();
TfLiteRegistration* Register_RNN();
TfLiteRegistration* Register_BIDIRECTIONAL_SEQUENCE_RNN();
TfLiteRegistration* Register_UNIDIRECTIONAL_SEQUENCE_RNN();
TfLiteRegistration* Register_EMBEDDING_LOOKUP();
TfLiteRegistration* Register_EMBEDDING_LOOKUP_SPARSE();
TfLiteRegistration* Register_FULLY_CONNECTED();
TfLiteRegistration* Register_LSH_PROJECTION();
TfLiteRegistration* Register_HASHTABLE_LOOKUP();
TfLiteRegistration* Register_SOFTMAX();
TfLiteRegistration* Register_CONCATENATION();
TfLiteRegistration* Register_ADD();
TfLiteRegistration* Register_SPACE_TO_BATCH_ND();
TfLiteRegistration* Register_DIV();
TfLiteRegistration* Register_SUB();
TfLiteRegistration* Register_BATCH_TO_SPACE_ND();
TfLiteRegistration* Register_MUL();
TfLiteRegistration* Register_L2_NORMALIZATION();
.....
TfLiteRegistration* Register_SPARSE_TO_DENSE()

We have no docs, but my guess is that the EVE hardware is some kind of SIMD coprocessor designed for matrix operations. It is very likely controlled via microcode. So if my guess is right, what we need is EVE microcode that implements the 63 core tensorflow op codes. And that is something Allwinner has to provide. Once we have the microcode implementing the 63 ops, it is trivial to hook it into tensorflow lite.

Example of the space_to_batch_nd operation

This operation divides "spatial" dimensions [1, ..., M] of the input into a grid of blocks of shape block_shape, and interleaves these blocks with the "batch" dimension (0) such that in the output, the spatial dimensions [1, ..., M] correspond to the position within the grid, and the batch dimension combines both the position within a spatial block and the original batch position. Prior to division into blocks, the spatial dimensions of the input are optionally zero padded according to paddings. See below for a precise description.

https://www.tensorflow.org/api_docs/pyth...o_batch_nd

Login
Username:
Password:	Lost Password?
	Remember me