You're sitting at work when you receive an urgent call from a good friend. She quickly launches into a story about how her daughter fell out of a tree and she desperately needs to borrow some money to cover the upfront medical costs as their insurance had lapsed. Eager to help, you immediately take down your friend's bank account information and transfer her the amount she needs.

The following week you're at work and you receive a call from the CEO requesting you to complete an urgent fund transfer to a third party supplier. You recognize the CEO's distinctive voice and rush to complete the transfer before banking hours are over.

After both calls, you pat yourself on the back for being a good friend and a good employee. Except, in both cases, the caller was not who you thought. You just fell for voice phishing (vishing) attacks and transferred money to attacker-controlled accounts.

Like traditional phishing attacks, vishing relies on a sense of urgency. Unfortunately, vishing schemes have the added advantage of voice calls still largely being seen as trustworthy. This is especially true when the attacker modifies their voice to sound like an authority figure – such as a boss, a bank employee, or a police officer – or someone with whom the call recipient is friendly, or at least familiar.

During a vishing attack, a caller may attempt to impersonate a target through several different means. Such means include attackers simply pretending they are an authority figure, imitating or impersonating a person’s voice, or conducting a "replay" attack, in which the attacker uses a recording of a target taken from mediums like YouTube or podcasts. Newer twists to such attacks are callers using a voice modulator to alter the sound of their voice or callers using “deepfake” technology to generate a voice of a specific individual.

Voice modulation may become a larger area of concern as technology improves. Vishing fraudsters can already use simple voice modulators to change basic characteristics of their speech such as pitch and tone – think “What’s your favorite scary movie?” from the movie Scream. Simple modulators come in the forms of cheap physical devices or several easily available apps, many of which promise to impersonate celebrities and politicians for entertainment value. More expensive or professional hardware and software modulators offer greater precision in changing the tone of one’s voice and the phonetic quality of one’s vowel sounds. While modulators are currently unlikely to fool anyone into believing the user is a famous individual, they can be useful to attackers for changing the general shape (or potentially even gender) of their voice.

The larger threat comes in the form of deepfakes – audio fakes produced using Artificial Intelligence (AI) systems. Whereas voice modulation alters a caller’s voice, deepfakes can generate an imitation of a specific person which can be “trained” to say anything through text-to-speech inputs. Deepfake technology is already in development by several technology companies using generative adversarial networks. Such networks take provided audio inputs, like recordings of a specific person’s voice, and use AI and machine learning to develop human-sounding speech of the target. These systems allow the creation of an imitation voice indistinguishable to the human ear from the voice being mimicked. While such systems are being designed for the common good – such as by making better text-to-speech programs for the impaired – these systems could ultimately allow attackers to impersonate a target by using a believably sounding generated voice.

An attacker who is good at impersonations can already have great success during social engineering calls. I have a coworker who was able to obtain domain access within a client’s environment in less than 10 minutes by effectively impersonating one of their IT staff. Not everyone has those skills. Imagine what kind of success attackers could have if common technology enabled perfect impersonations in every attack from any caller.

As deepfake technologies become more common, the barrier to create realistic voice fakes for high-profile figures (or anyone to whom an attacker has large amounts of audio clips available, such as through social media) drops. Additionally, as these technologies increase and users create imitations of themselves for entertainment or for integration into other applications, risk grows regarding data breaches of privately generated voices. In the future, criminals could hypothetically avoid having to create a deepfake for a high-profile individual by instead purchasing a leaked voice made by the target themselves for integration with other services they use.

Cyber criminal focus on deepfakes is likely to increase as the technology proliferates. According to research from NTT, overall mentions of deepfakes increased between September 2018 and September 2019, with spikes in volume from mid-June to September 2019. At the same time, research observed spikes in mentions of deepfakes and vishing and deepfakes and cybercrime. It seems likely that as knowledge of deepfakes and their potential uses in vishing attacks increases, malicious interest and attempts to exploit them will also increase.

While the likelihood of deepfakes and convincingly modulated audio being used in cyber crime is likely to increase, general user awareness training, along with tried and true policies, can help mitigate vishing’s potential financial business impact. 

First and foremost, organizations should implement and enforce a strict separation of duties under which no one person should be able to request and fulfill wire transfers. Additionally, a requestor and a transfer agent should either carry out verification of transfers in person or use two-factor authentication to confirm their identities. By prohibiting transfers from being authorized remotely from just a single phone call (or email, for that matter), organizations can mitigate the likelihood of a successful vishing scheme.

Additionally, while technological innovations such as voice biometric scans very well could emerge as possible barriers to deepfakes, it is best to remember that, in the end, people are always the best and worst security control. It all comes down to the training and processes in place. So, while audio modulation and associated vishing schemes may emerge as a new popular attack vector, they can be mitigated if we account for them in security programs.

And, in the case of a friend requesting an urgent transfer of funds, it’s probably a good idea to subject the request to a default level of skepticism. Offer to call them back at a number you know belongs to them, verify over other means like email, or – most safely of all – offer to write a check and meet up in person.