Adobe Voco

Last updated March 11, 2024

Adobe VoCo is an unreleased audio editing and generating prototype software by Adobe that enables novel editing and generation of audio. Dubbed "Photoshop-for-voice",^[1] it was first previewed at the Adobe MAX event in November 2016. The technology shown at Adobe MAX was a preview that could potentially be incorporated into Adobe Creative Cloud. It was later revealed that Voco was never meant to be released and was meant to be a research prototype.^[2]^[3]

Technical details

As the demo showed, the software takes approximately 20 minutes of the desired target's speech and generates a sound-alike voice including phonemes that were not present in the target example material. Adobe stated Voco would lower the cost of audio production.^[1]^[3]

Concerns

Ethical and security concerns were raised over the ability to alter an audio recording to include words and phrases the original speaker never spoke, and the potential risk to voiceprint biometrics.^[1]

Concerns also rose that it may be used in conjunction with:

Human image synthesis, which has reached such levels of likeness since the early 2000s that distinguishing between a human recorded with a camera and a simulation of a human is very difficult.^[5]
Video manipulation of a person's facial expressions in near real-time using an existing 2D RGB video of them.^[6]

Alternatives

Adobe's lack of publicized progress opened opportunities for other projects to build alternative products to VOCO, such as Resemble AI and 15.ai, a real-time text-to-speech tool using artificial intelligence.

WaveNet is a similar but open-source research project at London-based artificial intelligence firm DeepMind, developed independently around the same time as Adobe Voco.

Related Research Articles

A digital audio workstation is an electronic device or application software used for recording, editing and producing audio files. DAWs come in a wide variety of configurations from a single software program on a laptop, to an integrated stand-alone unit, all the way to a highly complex configuration of numerous components controlled by a central computer. Regardless of configuration, modern DAWs have a central interface that allows the user to alter and mix multiple recordings and tracks into a final produced piece.

Adobe Audition is a digital audio workstation developed by Adobe Inc. featuring both a multitrack, non-destructive mix/edit environment and a destructive-approach waveform editing view.

Adobe Premiere Pro is a timeline-based and non-linear video editing software application (NLE) developed by Adobe and published as part of the Adobe Creative Cloud licensing program. First launched in 2003, Adobe Premiere Pro is a successor of Adobe Premiere. It is geared towards professional video editing, while its sibling, Adobe Premiere Elements, targets the consumer market.

The RT.X100 Pro Suite was a real-time PCI video editing card manufactured by Matrox Corporation. With the use of Adobe Premiere it enabled a real time preview on TV or Video Monitor. It was generally bundled with Adobe Premiere Pro, Adobe Audition, and Adobe Encore DVD. The RT.X100 Pro Collection added a copy of Adobe After Effects. It was released in 2003 and meant to replace the Matrox RT2500.

Vegas Pro is a professional video editing software package for non-linear editing (NLE). The first release of Vegas Beta was on 11 June 1999. The software runs on the Windows operating system.

Adobe Encore was a DVD authoring software tool produced by Adobe Systems and targeted at professional video producers. Video and audio resources could be used in their current format for development, allowing the user to transcode them to MPEG-2 video and Dolby Digital audio upon project completion. DVD menus could be created and edited in Adobe Photoshop using special layering techniques. Adobe Encore did not support writing to a Blu-ray Disc using AVCHD 2.0.

A number of vector graphics editors exist for various platforms. Potential users of these editors will make a comparison of vector graphics editors based on factors such as the availability for the user's platform, the software license, the feature set, the merits of the user interface (UI) and the focus of the program. Some programs are more suitable for artistic work while others are better for technical drawings. Another important factor is the application's support of various vector and bitmap image formats for import and export.

<span class="mw-page-title-main">Human image synthesis</span> Computer generation of human images

Human image synthesis is technology that can be applied to make believable and even photorealistic renditions of human-likenesses, moving or still. It has effectively existed since the early 2000s. Many films using computer generated imagery have featured synthetic images of human-like characters digitally composited onto the real or other simulated film material. Towards the end of the 2010s deep learning artificial intelligence has been applied to synthesize images and video that look like humans, without need for human assistance, once the training phase has been completed, whereas the old school 7D-route required massive amounts of human work .

The stutter edit, or stutter effect, is the rhythmic repetition of small fragments of audio, occurring as the common 16th note repetition, but also as 64th notes and beyond, with layers of digital signal processing operations in a rhythmic fashion based on the overall length of the host tempo. The Stutter Edit audio software VST plug-in implements forms of granular synthesis, sample retrigger, and various effects to create a certain audible manipulation of the sound run through it, in which fragments of audio are repeated in rhythmic intervals. The plug-in allows musicians to manipulate audio in real time, slicing audio into small fragments and sequences the pieces into rhythmic effects, recreating techniques that formerly took hours to do in the studio. Electronic musician Brian Transeau is widely recognized for pioneering the stutter edit as a musical technique; he developed, coined the term, and holds multiple patents for the Stutter Edit software plug-in.

Artificial intelligence and music (AIM) is a common subject in the International Computer Music Conference, the Computing Society Conference and the International Joint Conference on Artificial Intelligence. The first International Computer Music Conference (ICMC) was held in 1974 at Michigan State University. Current research includes the application of AI in music composition, performance, theory and digital sound processing.

Adobe Creative Cloud is a set of applications and services from Adobe that gives subscribers access to a collection of software used for graphic design, video editing, web development, photography, along with a set of mobile applications and also some optional cloud services. In Creative Cloud, a monthly or annual subscription service is delivered over the Internet. Software from Creative Cloud is downloaded from the Internet, installed directly on a local computer and used as long as the subscription remains valid. Online updates and multiple languages are included in the CC subscription. Creative Cloud was initially hosted on Amazon Web Services, but a new agreement with Microsoft has the software, beginning with the 2017 version, hosted on Microsoft Azure.

Adobe Character Animator is a desktop application software product that combines real-time live motion-capture with a multi-track recording system to control layered 2D puppets based on an illustration drawn in Photoshop or Illustrator. It is automatically installed with Adobe After Effects CC 2015 to 2017 and is also available as a standalone application which one can download separately as part of a Creative Cloud all-apps subscription. It is used to generate real-time 2D animations to produce both live and non-live animation.

Adobe XD is a vector design tool for web and mobile applications, developed and published by Adobe Inc. It is available for macOS and Windows, and there are versions for iOS and Android to help preview the result of work directly on mobile devices. Adobe XD enables website wireframing and creating click-through prototypes.

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

<span class="mw-page-title-main">Adobe Enhanced Speech</span> Artificial intelligence software

Adobe Enhanced Speech is an online artificial intelligence software tool by Adobe that aims to significantly improve the quality of recorded speech that may be badly muffled, reverberated, full of artifacts, tinny, etc. and convert it to a studio-grade, professional level, regardless of the initial input's clarity. Users may upload mp3 or wav files up to an hour long and a gigabyte in size to the site to convert them relatively quickly, then being free to listen to the converted version, toggle back-and-forth and alternate between it and the original as it plays, and download it.

References

1 2 3 "sapic". BBC.com . BBC. 2016-11-07. Retrieved 2016-07-05.
↑ "Beta Testing #VoCo". 8 November 2016.
1 2 "Is Adobe VoCo dead ?". Adobe Blog. 2018-01-27. Retrieved 2020-06-17.
↑ "Now in Beta: Introducing Text-Based Editing in Premiere Pro". community.adobe.com. 2023-02-03. Retrieved 2023-04-16.
↑ Rodgers, Julian. "Adobe Voco - Should We Be Afraid?". Production Expert. Pro Tools. Retrieved 14 December 2018.
↑ Thies, Justus (2016). "Face2Face: Real-time Face Capture and Reenactment of RGB Videos". Proc. Computer Vision and Pattern Recognition (CVPR), IEEE. Retrieved 2016-06-18.

This simulation software article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[BBC2016-1] 1 2 3 "sapic". BBC.com . BBC. 2016-11-07. Retrieved 2016-07-05.

[2] "Beta Testing #VoCo". 8 November 2016.

[deepmind.com2016-3] 1 2 "Is Adobe VoCo dead ?". Adobe Blog. 2018-01-27. Retrieved 2020-06-17.

[4] "Now in Beta: Introducing Text-Based Editing in Premiere Pro". community.adobe.com. 2023-02-03. Retrieved 2023-04-16.

[RodgersAdobeVoco-5] Rodgers, Julian. "Adobe Voco - Should We Be Afraid?". Production Expert. Pro Tools. Retrieved 14 December 2018.

[Thi2016-6] Thies, Justus (2016). "Face2Face: Real-time Face Capture and Reenactment of RGB Videos". Proc. Computer Vision and Pattern Recognition (CVPR), IEEE. Retrieved 2016-06-18.

[1]

[2]

[3]

[4]

[5]

[6]