The challenge (and the advantages) of artificial intelligence recognizing us by voice and face

The challenge (and the advantages) of artificial intelligence recognizing us by voice and face

There is no doubt that technology has become a crucial part of our lives. Mobile phones, tablets or computers allow us to be in constant connection with other people, create content, make bank transactions, purchase all kinds of items or attend a medical consultation from home, among many other things.

A big problem of this new era on-line is that unwanted people can access to all the information on our devices. And we can also find access difficulties if we do not remember Password or when looking for specific information in videos.

That is why it is so important to incorporate artificial intelligence techniques that recognize unique and non-transferable features of the user, such as their face or voice, as a “digital fingerprint”. The advantage over fingerprints, for example, is that the devices do not require specific technology: just the camera and/or microphone that almost all models already incorporate.

Machines that learn in the style of our neurons

In recent years, great advances have been made in this field thanks to the techniques of deep learning based on neural networks. These networks try to learn just like the brain, simulating the learning process by success and error carried out by our neurons. For example, when we are babies, we do not know how to distinguish who we are seeing or hearing. The brain learns to identify them with experience.

The key to the process is, therefore, training. It is about offering the system a set of input data, indicating the information to be learned from them. Once you assimilate this information, you will know what to do when you receive new data. In the case at hand, voices and images of faces.

These techniques already work quite well when the system is “fed” with a lot of information. But what happens if we want to create a system for recognizing people by their voice with little specific data for the application where it is going to be used?

Identify the exact phrase

Today, it is easy to have sound recordings of people talking about any subject, but not so much saying a specific phrase that allows improving security or customizing recognition systems.

An example are the virtual assistants that are only activated when the owner says: ‘Hey, Siri’ or ‘Ok, Google’. These devices already work quite well today, but developers can not always have the immense resources that Apple or Google have.

In such cases, with little adequate data to teach the system, using large generically trained neural networks is not the best solution. The system will not be able to correctly differentiate between multiple individuals speaking and saying a specific phrase.

To address this challenge, at the Engineering Research Institute of Aragon, University of Zaragoza, we have used modified neural networks. In its development we took into account the importance of the person speaking pronouncing the corresponding phrase, since treating all parts of the recordings equally –as large neural networks do– is not ideal in these cases.

To this end, we introduced modifications that would allow the attention of the systems to be fixed on the different segments of the sentence pronounced, apart from recognizing the identity of the speaker. The networks thus created have proven to be robust and capable of differentiating quite well between different people saying specific things.

Beyond these advances, the scarcity of specific data in certain situations continues to be a challenge to improve the security and personalization of recognition systems. For example, we still run into problems when the announcer’s voice changes a lot due to illness.

On the other hand, we can also face the opposite problem: what happens when we have too much information and two physical features to recognize?

Simultaneous voice and face recognition

The expansion of devices with cameras and microphones has exponentially increased the volume of videos available on the devices themselves or on the Internet in general. Those recordings are very valuable. to develop artificial intelligence techniques: voices and faces can be used to create more secure systems that identify both features at the same time.

However, we need to know what exact information appears in the files. Until now that process has been done manually and is very expensive.

In the work cited above, we also developed new joint voice and face recognition systems that can help analyze and catalog audiovisual content more efficiently and automatically. For example, it would allow you to search in a news program when someone has spoken about a topic or when they have appeared on the scene, even if they are silent.

In short, voice and face recognition technology has come a long way in recent years and is already part of our daily lives, but there are still challenges ahead. It is important to address them to improve access and security of our devices and bring technology closer to all people.

Victoria Mingote GoodPostdoctoral Researcher of the Department of Electronic Engineering and Communications and of the University Institute of Engineering Research of Aragon (I3A), Zaragoza’s University

This article was originally published on The Conversation. read the original.

Leave a Comment

Your email address will not be published. Required fields are marked *