tensorflow and V5 - Printable Version +- Lindeni Forum (http://forum.lindeni.org) +-- Forum: Lindenis V5 (http://forum.lindeni.org/forumdisplay.php?fid=5) +--- Forum: General Discussion on Lindenis V5 (http://forum.lindeni.org/forumdisplay.php?fid=6) +--- Thread: tensorflow and V5 (/showthread.php?tid=319) Pages:
1
2
|
tensorflow and V5 - jonsmirl - 11-17-2018 One way to eliminate the need for specialized tools to program the EVE AI hardware would be to provide a tensorflow driver for it. Then you could use the tensorflow tools to make the classifers and run them on the V5. In our use case we'd like to make classifiers to detect audio patterns. This Qualcomm chip is similar to the V5 and it uses tensorflow... https://www.cnx-software.com/2018/04/12/qualcomm-qcs603-qcs605-iot-socs-are-designed-for-ai-and-computer-vision-applications/ RE: tensorflow and V5 - given - 12-04-2018 Thanks for your advice, I think this is the next step we work for the V5 Lindeni boards. RE: tensorflow and V5 - jonsmirl - 12-04-2018 (12-04-2018, 02:10 AM)given Wrote: Thanks for your advice, I think this is the next step we work for the V5 Lindeni boards. The way you do the port is to start from the ARM64 backend and then start modifying it to use the dedicated HW instead of the main CPU. https://www.tensorflow.org/xla/developing_new_backend But you have to have detailed docs on the AI hardware to do this. There is virtually no info on the AI hardware in the existing docs. RE: tensorflow and V5 - given - 12-04-2018 (12-04-2018, 02:15 AM)jonsmirl Wrote:(12-04-2018, 02:10 AM)given Wrote: Thanks for your advice, I think this is the next step we work for the V5 Lindeni boards. Yes, I think so. So this work we need to communicate with Allwinner and make it come true together. RE: tensorflow and V5 - jonsmirl - 12-04-2018 Tensorflow will run on the V5 right now using the ARM64 backend, it just runs slowly. You can also run it on your desktop using the GPU in your graphics card. Depending on how the V5 AI hardware works, it may be possible to train the network on your desktop GPU and then only run the trained network on the V5. That will greatly speed up the training process. The V5 hardware may actually be quite good at voice recognition once it is possible to train it on new models. You think audio is one dimensional but when you use a neural network on audio you first compute the FFT, doing that turns the recording into 2D data similar to an image, and the V5 has image processing hardware! This tensorflow article shows the technique. https://www.tensorflow.org/tutorials/sequences/audio_recognition RE: tensorflow and V5 - given - 12-04-2018 (12-04-2018, 02:42 AM)jonsmirl Wrote: Tensorflow will run on the V5 right now using the ARM64 backend, it just runs slowly. You can also run it on your desktop using the GPU in your graphics card. Depending on how the V5 AI hardware works, it may be possible to train the network on your desktop GPU and then only run the trained network on the V5. That will greatly speed up the training process. Yes, training require many many GPU resources. And I think it is too difficult for a embedded system to training, let alone there is not GPU in V5. RE: tensorflow and V5 - jonsmirl - 12-06-2018 Once your company become skilled at writing the V5 hardware specific backend for Tensorflow, I suspect you will have a valuable skill that can be easily marketed to other chip vendors making AI engines. It pretty obvious than every chip vendor is going to start adding AI engines to their SOCs. That implies that every chip vendor will also need Tensorflow (or similar) drivers for their hardware. I just noticed that Google has a new lightweight run-time for Tensorflow. https://www.tensorflow.org/lite/ https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite This library supports being modified for custom hardware. It also designed for lower memory environments like an embedded device or cell phone. It is used in Nest camera. RE: tensorflow lite and V5 - jonsmirl - 12-08-2018 I was able to compile tensorflow lite and run it on the V5. I used two different models, the first one has been quantized to only use integers, it is the recommended mode for phones/SOCs. The second model is more complex and it uses floating point. On the V5 the simple model runs in 1s and uses all four cores, the complex model runs in 52s. On my desktop GPU the complex model runs in under 0.1s. These instructions were helpful. https://medium.com/@haraldfernengel/compiling-tensorflow-lite-for-a-raspberry-pi-786b1b98e646 The input is color images, 224 x 224 pixels. In both cases the test image was matched against 1,000 possible labels. Exe size is 1.5MB, model size is 4.3MB for quant model. Complex model is 95MB. These models started off as desktop type models and then you compile them to run on the tensorflow lite run-time. # time ./label_image Loaded model ./mobilenet_quant_v1_224.tflite resolved reporter invoked average time: 1049.15 ms 0.666667: 458 bow tie 0.290196: 653 military uniform 0.0117647: 835 suit 0.00784314: 611 jersey 0.00392157: 922 book jacket real 0m 1.19s user 0m 3.27s sys 0m 0.00s # # time ./label_image Loaded model ./inceptionv3_slim_2016.tflite resolved reporter invoked average time: 51670.8 ms 8.80859: 653 military uniform 5.90491: 668 mortarboard 5.367: 401 academic gown 5.18728: 835 suit 4.91802: 458 bow tie real 0m 53.09s user 2m 59.21s sys 0m 0.86s # The AI kernel is surprisingly small, only 12,000 lines of code. I don't have any info on how the EVE hardware works, but I suspect it is a fixed point system similar to the first test. Based on my testing so far, my approach would be to first familiarize myself with tensorflow lite on ARM. Then as I understand how it works, look into modifying it to call out into the EVE hardware where appropriate. It looks like you port it to specialized hardware by writing routines to implement these operations. Each of these operations has a C++ implementation in the source code. So you can go one operation at a time and reimplement it with the V5 hardware support. There are 63 opcodes in the source. Porting the lite version appears to be much easier than porting the full version since the lite version assumes fixed, precompiled models and does not have a JIT to dynamically adapt during training. TfLiteRegistration* Register_RELU(); TfLiteRegistration* Register_RELU_N1_TO_1(); TfLiteRegistration* Register_RELU6(); TfLiteRegistration* Register_TANH(); TfLiteRegistration* Register_LOGISTIC(); TfLiteRegistration* Register_AVERAGE_POOL_2D(); TfLiteRegistration* Register_MAX_POOL_2D(); TfLiteRegistration* Register_L2_POOL_2D(); TfLiteRegistration* Register_CONV_2D(); TfLiteRegistration* Register_DEPTHWISE_CONV_2D(); TfLiteRegistration* Register_SVDF(); TfLiteRegistration* Register_RNN(); TfLiteRegistration* Register_BIDIRECTIONAL_SEQUENCE_RNN(); TfLiteRegistration* Register_UNIDIRECTIONAL_SEQUENCE_RNN(); TfLiteRegistration* Register_EMBEDDING_LOOKUP(); TfLiteRegistration* Register_EMBEDDING_LOOKUP_SPARSE(); TfLiteRegistration* Register_FULLY_CONNECTED(); TfLiteRegistration* Register_LSH_PROJECTION(); TfLiteRegistration* Register_HASHTABLE_LOOKUP(); TfLiteRegistration* Register_SOFTMAX(); TfLiteRegistration* Register_CONCATENATION(); TfLiteRegistration* Register_ADD(); TfLiteRegistration* Register_SPACE_TO_BATCH_ND(); TfLiteRegistration* Register_DIV(); TfLiteRegistration* Register_SUB(); TfLiteRegistration* Register_BATCH_TO_SPACE_ND(); TfLiteRegistration* Register_MUL(); TfLiteRegistration* Register_L2_NORMALIZATION(); ..... TfLiteRegistration* Register_SPARSE_TO_DENSE() We have no docs, but my guess is that the EVE hardware is some kind of SIMD coprocessor designed for matrix operations. It is very likely controlled via microcode. So if my guess is right, what we need is EVE microcode that implements the 63 core tensorflow op codes. And that is something Allwinner has to provide. Once we have the microcode implementing the 63 ops, it is trivial to hook it into tensorflow lite. Example of the space_to_batch_nd operation This operation divides "spatial" dimensions [1, ..., M] of the input into a grid of blocks of shape block_shape, and interleaves these blocks with the "batch" dimension (0) such that in the output, the spatial dimensions [1, ..., M] correspond to the position within the grid, and the batch dimension combines both the position within a spatial block and the original batch position. Prior to division into blocks, the spatial dimensions of the input are optionally zero padded according to paddings. See below for a precise description. https://www.tensorflow.org/api_docs/python/tf/space_to_batch_nd RE: tensorflow and V5 - given - 12-10-2018 Dear jonsmirl, we have simply run the Tensorflow Lite on V5 board before. And the idea you mention that porting the TensorFlow Lite to use the EVE hardware is great. And we will discuss it with Allwinner, because all the docs of EVE have not unveiled. RE: tensorflow and V5 - jonsmirl - 12-10-2018 When reading through the EVE documents I noticed this: 1.2.1 EVE 功能描述 1.360p检测速度>30FPS(工作主频 >300MHZ) 2.支持经典的HAAR特征分类器检测,特征数最高可达3200个 3.支持最大分辨率4K输入和内部缩放,支持感兴趣区域检测 4.支持4通道积分图计算,每秒处理13亿个特征 5.支持3通道特征计算 6.支持用户自定义目标尺寸,支持432个不同种类的检测框 7.支持最小64x64像素单图像检测 8.功耗低,360p全图检测<57mv 9.分类器可视化设计,提供一整套完整训练、调试、测试与评估工具 10.可定制化分类器,支持任意小变形刚性目标检测(人脸、车辆、车牌、行人、 It is a HAAR feature classifier. That is an old system from before the development of tensorflow type systems. So it is not clear if tensorflow can make use of the EVE hardware, it depends on how they implemented HAAR, is it microcode or a ROM? Only an AW engineer will know the answer. It does support customizable classifiers so it may be microcode. On the other hand, a HAAR classifier is fine for our use case. But we need to be able to retrain it. |