Automatic speaker identification has gained popularity in security applications due to its proficiency in detecting and countering fraudulent attempts to mimic a user’s voice through digital manipulation.
However, digital security engineers from the University of Wisconsin-Madison discovered that these systems have a weakness when facing a novel analog attack. According to the team, speaking through customized PVC pipes can fool machine learning algorithms supporting automatic speaker identification systems.
Shimaa Ahmed, the team lead and Ph.D. student, emphasized that the finding challenges the notion that automatic speaker identification is “as secure as a fingerprint.” Furthermore, Ahmed highlighted that the attack is very cheap and can be accomplished with a tube from a hardware store.
Ahmed’s team initially found a hole when they discovered the models did not respond the same way to someone speaking through their hands or into a box compared to communicating clearly.
As a result, the researchers wanted to learn if they could change frequency vibrations of a voice to beat the security system. They did so by altering the diameter and length of PVC pipes until they generated the resonance they were attempting to replicate.
Eventually, the team managed to create an algorithm that could determine PVC pipe dimensions required to imitate a voice, and according to the researchers, their algorithm defeated the security systems 60% of the time.
The team explained that the attack works because its analog nature allows it to evade the digital attack filters of the voice authentication system. Additionally, the PVC pipe imitates the resonance of the target voice instead of replicating it, which can confuse the machine learning algorithm and lead to a misclassification of the attacking voice.
Team lead and professor of electrical and computer engineering Kassam Fawaz stressed the team’s point that all machine learning applications that analyze speech signals assume that the voice originates from a speaker and travels through the air to a microphone. However, one should not presume the voice will always conform to expectations, as possible alterations to the speech can disrupt the system’s underlying assumptions and lead to system errors.