51OpenLab-一站式ICT创新服务平台

【2023 Intel有奖征文】 - 在AlxBoard开发板上打造指哪识哪的OCR应用

openvino2021_sdk 更新于 7月前

OpenVINO场景文字检测

OpenVINO是英特尔推出的深度学习模型部署框架，当前最新版本是OpenVINO2023版本。OpenVINO2023自带各种常见视觉任务支持的预训练模型库Model Zoo，其中支持场景文字检测的网络模型是来自Model Zoo中名称为：text-detection-0003的模型(基于PixelLink架构的场景文字检测网络)。

图-1 PixelLink网络模型架构

图-1中的PixelLink场景文字检测模型的输入与输出格式说明

输入格式：1x3x768x1280 BGR彩色图像

输出格式：

name: "model/link_logits_/add",
[1x16x192x320] – pixelLink的输出

name: "model/segm_logits/add", [1x2x192x320]
– 像素分类text/no text

OpenVINO文字识别

OpenVINO支持文字识别(数字与英文)的模型是来自Model Zoo中名称为：text-recognition-0012d的模型，是典型的CRNN结构模型。 (基于类似VGG卷积结构backbone与双向LSTM编解码头的文字识别网络)

图-2 CRNN网络模型架构

图-2文本识别模型的输入与输出格式如下：

输入格式：1x1x32x120

输出格式：30, 1, 37

输出解释是基于CTC贪心解析方式，其中37字符集长度，字符集为：

0123456789abcdefghijklmnopqrstuvwxyz#

#表示空白。

MediaPipe手势识别

谷歌在2020年发布的mediapipe开发包说起，这个开发包集成了包含手势姿态等各种landmark检测与跟踪算法。其中支持手势识别是通过两个模型实现，一个是模型是检测手掌，另外一个模型是实现手掌的landmakr标记，

图-3 手势landmark点位说明

OpenVINO与MediaPipe库的安>

pip install openvino==2023.0.2

pip install mediapipe
请先安装好OpenCV-Python开发包依赖。
应用构建说明
首先基于OpenCV打开USB摄像头或者笔记本的web cam，读取视频帧，然后在每一帧中完成手势landmark检测，根据检测到手势landmark数据，分别获取左右手的食指指尖位置坐标（图-3中的第八个点位），这样就得到了手势选择的ROI区域，同时把当前帧的图像送入到OpenVINO场景文字识别模块中，完成场景文字识别，最后对比手势选择的区域与场景文字识别结果每个区域，计算它们的并交比，并交比阈值大于0.5的，就返回该区域对应的OCR识别结果，并显示到界面上。整个流程如下：

图-4程序执行流程图
代码实现
根据图-4的程序执行流程图，把场景文字检测与识别部分封装到了一个类TextDetectAndRecognizer，最终实现的主程序代码如下：
import cv2 as cv
import numpy as np
import mediapipe a***p
from text_detector import TextDetectAndRecognizer
digit_nums = ['0','1', '2','3','4','5','6','7','8','9','a','b','c','d','e','f','g',
'h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','#']

mp_drawing = mp.solutions.drawing_util***r/>mp_hand*****p.solutions.hand***r/>
x0 = 0
y0 = 0
detector = TextDetectAndRecognizer()

# For webcam input:
cap = cv.VideoCapture(0)
cap.set(cv.CAP_PROP_FRAME_HEIGHT, 1080)
cap.set(cv.CAP_PROP_FRAME_WIDTH, 1920)
height = cap.get(cv.CAP_PROP_FRAME_HEIGHT)
width = cap.get(cv.CAP_PROP_FRAME_WIDTH)
# out = cv.VideoWriter("D:/test777.mp4", cv.VideoWriter_fourcc('D', 'I', 'V', 'X'), 15, (np.int(width), np.int(height)), True)
with mp_hands.Hand****r/> min_detection_confidence=0.75,
min_tracking_confidence=0.5) as hands:
while cap.isOpened():
success, image = cap.read()

if not success:
break

image.flags.writeable = False
h, w, c = image.shape
image = cv.cvtColor(image, cv.COLOR_BGR2RGB)
results = hands.process(image)

image = cv.cvtColor(image, cv.COLOR_RGB2BGR)
x1 = -1
y1 = -1
x2 = -1
y2 = -1
if result***ulti_hand_landmarks:
for hand_landmarks in result***ulti_hand_landmarks:
mp_drawing.draw_landmark****r/> image,
hand_landmark****r/> mp_hands.HAND_CONNECTIONS)
for idx, landmark in enumerate(hand_landmarks.landmark):
x0 = np.int(landmark.x * w)
y0 = np.int(landmark.y * h)
cv.circle(image, (x0, y0), 4, (0, 0, 255), 4, cv.LINE_AA)
if idx == 8 and x1 == -1 and y1 == -1:
x1 = x0
y1 = y0
cv.circle(image, (x1, y1), 4, (0, 255, 0), 4, cv.LINE_AA)
if idx == 8 and x1 > 0 and y1 > 0:
x2 = x0
y2 = y0
cv.circle(image, (x2, y2), 4, (0, 255, 0), 4, cv.LINE_AA)

if abs(x1-x2) > 10 and abs(y1-y2) > 10 and x1 > 0 and x2 > 0:
if x1 < x2:
cv.rectangle(image, (x1, y1), (x2, y2), (255, 0, 0), 2, 8)
text = detector.inference_image(image, (x1, y1, x2, y2))
cv.putText(image, text, (x1, y1), cv.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 255), 2)
else:
cv.rectangle(image, (x2, y2), (x1, y1), (255, 0, 0), 2, 8)
text = detector.inference_image(image, (x2, y2, x1, y1))
cv.putText(image, text, (x2, y2), cv.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 255), 2)

# Flip the image horizontally for a selfie-view display.
cv.imshow('MediaPipe Hands', image)
# out.write(image)
if cv.waitKey(1) & 0xFF == 27:
break

cap.release()
# out.release()

移植到AlxBoard开发板上
在爱克斯开发板上安装好MediaPipe即可，OpenVINO不用安装了，因为爱克斯开发板自带OpenCV与OpenVINO，然后就可以直接把python代码文件copy过去，插上USB摄像头，直接使用命令行工具运行对应的python文件，就可以直接用了，这样就在AlxBoard开发板上实现了基于手势选择区域的场景文字识别应用。运行与测试结果如下：
图-5手势选择区域内的场景文字识别
图-6手势选择区域内的英文识词

后续指南：
安装语音播报支持包：pip install pyttsx
AlxBorad开发板是支持3.5mm耳机mic接口，支持语音播报的，如果把区域选择识别的文字，通过pyttsx直接播报就可以实现从手势识别到语音播报了，自动跟读卡片单词启蒙学英语，后续实现一波，请继续关注我们。