How does a computer recognize human actions autonomously?

Print Friendly, PDF & Email

Por Fernando Camarena, Ricardo Cuevas, Miguel González y Leonardo Chang
Artículo de divulgación

En algún punto de nuestra vida hemos tenido el sueño, ¿O temor?, de que aquello que llamabamos ficción se convierta en una realidad. Will the intersection between computer vision (CV) and artificial intelligence (AI) be the key to this path?

Muy probabilita, sin darnos cuenta, ya hemos sido beneficiados por estas tecnologías. Al tomar nuestro teléfono celular, we have CV and AI applications that range from a biometric facial recognition system to the modification of digital images via photo filters, widely used in social networks.

Sin embargo, el impacto y limite de application está lejos de ser alcanzado. Through its postgraduate programs, the Tecnológico de Monterrey boosts the synergy between students and professors researchers for the development of potentially transformative applications. Within the line of research on automatic learning (Machine Learning), we find different projects, including intelligent video surveillance, which are applied both theoretically and practically.

Intelligent video surveillance has the objective of granting “capabilities” to video cameras, including the detection of anomalies, fights, weapons or accidents. Para lograrlo, se necesita enseñarle a una computadora a ver y entender su entorno visual. But, how can we do this? In this article we will talk about the process of how a computer recognizes human action in videos.

El primer paso es entender que es una action; imaginemos que saludamos a una persona. Probably, the mental image created involves the characteristic movement of the hand and is that unconsciously we only associate a message with a sequence of gestures. It is precisely this sequence codified to what we will call “Action”.

Ahora que ya entemendes que queremos reconar, nos enfocaremos en explicar de forma breve cómo se forma una imagen digital. One camera is equipped with diverse sensors sensitive to visible light. Each one of these sensors has multiple “pixels” accommodated in a grid. Each pixel produces an electric charge proportional to the amount of light it receives, a charge that will be in the range of 0 to 255. In other words, a digital image is a matrix that contains values ​​between 0 and 255, as shown en la figure 1.

Figure 1: formation of a digital image

The video is just a sequence of images that can be used to recognize an action. The method proposed by our research group at the Tecnológico de Monterrey includes 3 steps:

  • Characterization of the video
  • Entrenar un modelo de intelligencia artificial
  • Inferencia del modelo de intelligencia artificial

Characterization of the video

Dos personas que no hablan el mismo idioma necesitan de un artefacto traductor que permita les entenderse. This same thing happens in computers, it is necessary to transform the data so that a computer can understand the concept of action in the best way.

We will call this step “characterization of the video” and we will divide it into three sub-steps: sample, tracking and observation. El muestreo consiste en saber qué parts de la imagen vamos a analyser. The trivial option consists of examining all the matrix of information, but the calculation will be delayed and unnecessary. So, the best way is to identify the locations of interest. In the proposed approach, we suggest that estimating the human pose helps to decrease the search space, as shown in figure 2.

Figure 2: Selection of the region of interest

At this point, we have the locations that the algorithm will use. Pero, un video es un conjunto de imágenes, por lo tanto, el siguiente paso consiste en saber las locaciones de interes a lo largo de todo el video. This is usually done through the use of optical flow algorithms, whose intuitive idea is expressed in figure 3.

Figure 3: The objective of the optical flow algorithm is to know the position of the balloon in the following frames. If we assume that this is equivalent to a value of 255 and the empty spaces with a value of 0, then it is easy to find the ball in the following box; tan solo hay que ver dónde se encuentra el valor de 255.

By applying the optical flow algorithm to each of the images, we will have a sequence in time of locations. A esta sequence se le denominará “trajectoria”. There will be a trajectory for each region of interest in the analysis, as shown in figure 4.

Figure 4: Formation of a trajectory

La observación es el ultimo sub-paso y es un processo que consiste en answere preguntas acerca del espacio y tiempo en el que se ubica la trajección. Por ejemplo, ¿Cómo fue el movimiento de la trajectory?, ¿En qué región sucedió el movimiento? Este tipo de preguntas puede ser contestada via histogramas aplicados tanto a la forma, como al flujo.

The trainer’s model

Figure 5: steps to train an AI model

At this moment, we have a set of video observations. But, para que una computadora pueda aprender, es necesario entrainar un modelo de intelligencia artificial (AI). A simple analogy is the construction of objects using clay.

La clay cruda represents a model of AI, que en un principio no tiene forma alguna. Dependiendo de lo que queramos, le daremos forma de un jarrón, un florero o cualquier otro objeto de nuestra elección. The process of forming the object is known as the training algorithm. This algorithm takes a “concept” in the form of observations and with it constructs the required information. En este caso, las observaciones son las trajeterias extraídas en el paso anterior.

The literature of the topic offers multiple training algorithms with different advantages and disadvantages, but one of the most well-known ones is through word processing techniques and vectorial support machines. Explaining the math behind these techniques can be quite complicated, but the intuitive idea is to find patterns of the type:

“Las trajección de la action caminar ocurren en la parte inferior de la imagen. Por lo contrario, en la action de saludar el movimiento ocurre en la parte superior”.

Inferencia del modelo de AI

Returning to our clay analogy, once the object has been modeled, the next step is to use that object. A esto se le conoce como inferencia de un modelo, en otras palabras, usarlo para lo que fue creado. Un ejemplo de inferencia, es utilizar estos models en la network de camaras del Tecnológico de Monterrey para identificar posibles incidents inside de algun campus.

The information described gives us an idea of ​​how a computer can detect an action. Pero, aún hay varios retos relacionadas a la vasta diversity y formas que un sujeto puede despeñar una action y las limitations de la camara.

¿Quieres saber más?

For more information on this work, we invite you to consult the following research articles:

Camarena, F., Chang, L., Gonzalez-Mendoza, M., & Cuevas-Ascencio, RJ (2022). Action recognition by key trajectories. Pattern Analysis and Applications1-15.

Camarena, F., Chang, L., & Gonzalez-Mendoza, M. (2019, May). Improving the dense trajectories approach towards efficient recognition of simple human activities. In 2019 7th International workshop on biometrics and forensics (IWBF) (pp. 1-6). IEEE.

Autores

Fernando Camarena

Fernando Camarena is a candidate for Doctor of Computational Sciences at the Tecnológico de Monterrey. Su Investigación se centra en el aprendizaje autosupervisado para tasks relacionados con videos.

Ricardo Cuevas

Ricardo Cuevas is a candidate for Doctor of Computer Sciences at the Tecnológico de Monterrey for his research in the area of ​​computer vision. Ricardo is an engineer in mechatronics and he is currently working as a web Full Stack developer.

Miguel González Mendoza

Miguel González Mendoza is a Doctor of Artificial Intelligence at the National Institute of Applied Sciences in Toulouse, France, and a Research Professor at the Department of Computing at the Tecnológico de Monterrey. Es Investigador Nacional Nivel II.

Leonardo Chang

Leonardo Chang is a Doctor of Computational Sciences at the Instituto Nacional de Astrofísica, Óptica y Electrónica de México. He was a researcher at the Centro de Aplicaciones de Tecnologías de Avanzada en Cuba from 2007 to 2017 and a professor at the Department of Computing Technology in Monterrey from 2017 to 2022. He is a National Investigator Level I.

Leave a Comment

Your email address will not be published. Required fields are marked *