“Wow!” - that was my first impression when I saw an OpenAI update regarding the release of Realtime API
Introducing OpenAI’s Realtime API marks a significant step forward for teams and AI developers seeking to create dynamic experiences
Whether you're building language apps, educational & training tools
the Realtime API simplifies the process by offering a single solution for real-time conversational experiences without the need for multiple models
I'm excited to guide you through how it works and explore its applications in various business scenarios
we’ll dive into some fascinating examples and uncover how you can make the most of this incredible tool
and let's discover new possibilities together
To keep things simple we may consider OpenAI's Realtime API as a tool that allows developers to stream audio inputs and outputs directly for more natural conversational experiences. It enables continuous communication with models like GPT-4o through a persistent WebSocket connection
The key feature of Realtime API is its possibility to enable seamless integration of multimodal features, allowing for natural speech-to-speech conversations using six preset voices, similar to ChatGPT’s Advanced Voice Mode
The main new features of this API are as follows:1
To better understand how realtime API can be integrated into the application
While the Chat Completions realtime API significantly simplified this workflow by combining the steps into a single API call
it still lagged behind the natural speed of human conversation
OpenAI's Realtime API addresses these limitations by directly streaming audio inputs and outputs
providing smoother conversational interactions and automatically managing interruptions
similar to the functionality of ChatGPT's Advanced Voice Mode
AI developers can set up a persistent WebSocket connection to communicate with GPT-4o
making it possible to exchange information continuously
which allows voice assistants to perform tasks or retrieve additional details based on the user’s needs
According to the official OpenAI’s Realtime API Documentation
the Realtime API enables you to build low-latency
It currently supports text and audio as both input and output
The WebSocket connection requires the following parameters:
Query Parameters: ?model=gpt-4o-realtime-preview-2024-10-01
OpenAI shows us examples of using API for different purposes (assuming you have already instantiated a WebSocket)
Send User Audio Stream User Audio Calling FunctionsClients can choose standard functions for the server to use during a session or set specific functions for each response as needed
The server will use these functions to handle requests if it finds them suitable
These functions are provided in a simple format
and no extra details are needed since only one type of tool is currently supported
let me submit that order for you.” The function description helps guide the server on what to do
such as “don’t confirm the order is completed yet” or “respond to the user before using the function.”
The client needs to reply to the function call by sending a message labeled "function_call_output." Adding this doesn’t automatically start another response from the system
so the client may need to start one manually if needed
OpenAI Realtime API examples in different industries
demonstrate how these features can help simplify voice-based applications
Let’s focus on the education sector as a great example of API usage
it’s time to talk about how all this magic works in real businesses and discuss realtime API examples in the Edtech industry
OpenAI has already provided ChatGPT Edu - the first LLM created for specific educational purposes
Realtime API has even more options to be used
OpenAI’s Realtime API can be effectively integrated into AI agents for educational & training purposes
enabling more interactive and engaging experiences for students
AI Teaching Assistants can facilitate spoken discussions and provide immediate feedback during practice sessions
helping students to understand concepts more thoroughly
Realtime API examples in educational settings include AI Tutors that can conduct voice-based quizzes
making the learning process feel more natural
our AI Agents Platform IONI allows you to create an agent to communicate with the AI Avatar of the Teacher/Tutor and get all the needed information about the course
integrating the realtime API into AI Learning Companions allows them to support students with disabilities by offering spoken instructions and responses
Using the API for AI Study Buddies can also aid in language learning by conducting conversational practice and pronunciation assessments
the realtime API can simulate natural conversations
allowing learners to engage in realistic dialogue scenarios
The ability of AI agents to instantly process spoken language input and produce clear voice responses allows them to assist learners in practicing new languages
or even mastering specific professional jargon
this technology opens new possibilities for AI in eLearning content delivery
AI Instructors can use OpenAI’s Realtime API to present audio lectures
This capability makes the learning experience more engaging for remote and in-person students
as they can conversationally communicate with the AI
almost like having a real tutor available at any time
the Realtime API lets educational chatbots assist educators with administrative tasks
such as automating attendance tracking through voice recognition
This allows teachers to focus more on instructional activities than routine tasks
creating a more efficient and enjoyable educational environment
OpenAI’s Realtime API fully provides multiple safety measures
using experiences from ChatGPT’s Advanced Voice Mode and many QA tests with external experts
These steps aim to minimize risks while providing reliable voice interaction
Let’s have a look at some important safety concerns and challenges of OpenAI Realtime API
The realtime API system employs both automated tools and human review to identify flagged content that may violate policies
This helps detect and address harmful or inappropriate uses
Developers are prohibited from using the API for malicious activities such as spamming or spreading false information
and violations can lead to service suspension
API developers are required to clearly inform users when they are interacting with AI
This ensures that people know they are not speaking with a human
which can reduce confusion or unintended trust
The Realtime API follows strict privacy commitments
Data is not used to train models without the user’s explicit consent
thereby protecting sensitive information and complying with privacy standards
OpenAI officially created Usage Policies that should be followed by all teams that are using any of OpenAI’s services or APIs
We have to admit that OpenAI continues to improve the Realtime API day by day
And feedback is being gathered to guide improvements before it reaches broader availability
with plans to expand into other modalities like vision and video in the future
Current limitations allow around 100 simultaneous sessions for top-tier developers
but this will gradually increase to accommodate larger-scale use cases
the Realtime API is planned to be incorporated into official OpenAI Python and Node.js SDKs
making it easier for ChatGPT developers to adopt
New features like prompt caching are also on the roadmap
allowing past conversation history to be reprocessed more cost-effectively
The API's features will expand with support for LLMs like GPT-4o mini or even maybe GPT-4 Edu in upcoming releases
broadening the range of potential applications
These updates aim to empower developers in building new experiences for industries like education
we should not forget about the alternatives of OpenAI
Such giants as Google’s Gemini and Meta’s LLama won’t stay aside and will provide their updates too
it will play a crucial role in creating interactive
multimodal experiences that go beyond text
it is time to make some conclusions based on what we talked about
the release of Realtime API by OpenAI starts a new era of updated AI agents
especially when we talk about such industries as Education
Its new Websocket-Connection technology helps to provide smooth interaction between the user and the LLM in realtime format
We may forget about 3-4 level integration to make it possible to talk for a regular user with AI Avatar
Realtime API covers all these gaps and makes your software user-friendly
OpenAI’s Realtime API opens new possibilities for engineers by simplifying the process of creating voice and video experiences
Its ability to directly connect users to AI avatars without complicated integrations makes technology more accessible
This ease of implementation is particularly valuable for businesses looking to enhance user interaction and provide seamless customer experiences
OpenAI’s Realtime API boosts the way for more intuitive AI/ML development tools that are used for industry-specific needs.
Ilya Gelfenbeyn was the founding CEO of API.ai
He and his team were true pioneers in the voice AI industry and were rewarded for those efforts through an acquisition by Google in 2016
API.ai was the development environment that most of the first Google Actions were built upon
It is better known today as Dialogflow after its rebranding in 2017
and is one of the most widely used solutions for building conversational AI experiences
What you may not know is that API.ai was preceded by Speaktoit which was known as the Siri of Android
Speaktoit amassed over 40 million users for the app-based virtual assistant
The experience taught the team a lot about the tooling required to deploy a successful conversational assistant
That ultimately led to the creation of developer tools and the pivot into API.ai
Gelfenbeyn later was a founding member of Google Assistant investments where he was involved in direct funding of several prominent voice AI startups
Today he leads an angel syndicate called The AI where he invests in AI-related companies
You can listen to the podcast interview above
on Google or Apple Podcasts or most of the leading podcast players
Jeremiah Owyang Analyzes the Rise of Social Audio on Clubhouse, Twitter and More – Voicebot Podcast 195
Rohit Prasad Amazon VP and Head Scientist for Alexa AI – Voicebot Podcast Ep 191
Cheryl Platz Author of Design Beyond Devices on Multimodal Voice UX – Voicebot Podcast Ep 190
He was named analyst and journalist of the year in 2019
and 2021 and is widely cited in media and academic research as an authority on voice assistants and AI
He is also the host of the Voicebot Podcast and editor of the Voice Insider newsletter
Let the AI Gold Rush BeginBusinesses can now get paid for services built on the large language model
meaning chatbots are going to start appearing everywhere.ILLUSTRATION: ANJALI NAIR; GETTY IMAGESSave this storySaveSave this storySaveWhen OpenAI
the San Francisco company developing artificial intelligence tools
announced the release of ChatGPT in November 2022
former Facebook and Oculus employee Daniel Habib moved quickly
Habib used the chatbot to build QuickVid AI
which automates much of the creative process involved in generating ideas for YouTube videos
Creators input details about the topic of their video and what kind of category they’d like it to sit in
then QuickVid interrogates ChatGPT to create a script
Other generative AI tools then voice the script and create visuals
a speech recognition AI the company has developed
Habib hooked up QuickVid to the official ChatGPT API
“All of these unofficial tools that were just toys
that would live in your own personal sandbox and were cool can now actually go out to tons of users,” he says
OpenAI’s announcement could be the start of a new AI goldrush
What was previously a cottage industry of hobbyists operating in a licensing gray area can now turn their tinkering into fully-fledged businesses
“What this release means for companies is that adding AI capabilities to applications is much more accessible and affordable,” says Hassan El Mghari
which uses ChatGPT’s computational power to generate Twitter profile text for users
a data science and AI consultancy based in London
will be “critical” for getting companies to use the API
Foster thinks the fear that personal information of clients or business critical data could be swallowed up by ChatGPT’s training models was preventing them from adopting the tool to date
“It shows a lot of commitment from OpenAI to basically state
You’re not going to find your company’s data turning up in that general model,’” he says
This policy change means that companies can feel in control of their data
rather than have to trust a third party—OpenAI—to manage where it goes and how it’s used
“You were building this stuff effectively on somebody else’s architecture
according to somebody else’s data usage policy,” he says
combined with the falling price of access to large language models
means that there will likely be a proliferation of AI chatbots in the near future
API access to ChatGPT (or more officially, what OpenAI is calling GPT3.5) is 10 times cheaper than access to OpenAI’s lower-powered GPT3 API, which it launched in June 2020
and which could generate convincing language when prompted but did not have the same conversational strength as ChatGPT
“It’s much cheaper and much faster,” says Alex Volkov
founder of the Targum language translator for videos
which was built unofficially off the back of ChatGPT at a December 2022 hackathon
That could change the economics of AI for many businesses
“It’s an amazing time to be a founder,” QuickVid’s Habib says
“Because of how cheap it is and how easy it is to integrate
every app out there is going to have some type of chat interface or LLM [large language model] integration … People are going to have to get very used to talking to AI.”
In your inbox: Get Plaintext—Steven Levy's long view on tech
Federal judge allows DOGE to take over $500 million office building for free
Big Story: The quantum apocalypse is coming
Bluesky can’t take a joke
Summer Lab: Explore the future of tech with WIRED
It is the essential source of information and ideas that make sense of a world in constant transformation
The WIRED conversation illuminates how technology is changing every aspect of our lives—from culture to business
The breakthroughs and innovations that we uncover lead to new ways of thinking
ChatGPT has quickly evolved from a fun — and occasionally creepy — distraction into an enterprise solution attracting substantial interest
OpenAI making ChatGPT available via API means businesses can more easily layer the software into their own apps and websites and support native experiences on those channels
The API is also about 10 times cheaper than the existing GPT-3.5 models
along with being the best version currently available for non-text-based applications
The move primes ChatGPT for wider adoption among brands and platforms, capitalizing on the massive amount of hype that’s surrounded the product since its November launch. Snap Inc. earlier this week unveiled a My AI chatbot for Snapchat+ that relies on ChatGPT API
OpenAI in the announcement detailed several other partnerships built on the tech
a learning platform that’s worked with OpenAI for several years
is leveraging ChatGPT API for a tutoring function dubbed Q-Chat that can answer students’ questions
Instacart is pairing the third-party AI with its own while using product data from retailers to provide shoppable answers to questions such as
“How do I make great fish tacos?” A planned feature
the consumer-facing app for e-commerce firm Shopify
is similarly supporting a new shopping assistant with ChatGPT API
These early use cases show how ChatGPT is sparking renewed interest in areas like chatbots that have stoked excitement over the years but often failed to live up to their promises and frequently frustrated end users
ChatGPT itself remains prone to errors and off-putting responses
while generative AI as a whole is a thorny field that’s raised serious questions around ethics and ownership
Still, the business world seems all in on AI at the moment. Microsoft invested $10 billion in OpenAI in January and is using its software to upgrade Bing and Edge. Bain & Company last month forged an alliance with OpenAI that will help the consultant design bespoke AI services for blue-chip clients including Coca-Cola
Google and Meta Platforms are ramping up their own AI-based initiatives to keep pace
Get the free daily newsletter read by industry experts
Nike’s first big game ad in 27 years and other purpose-driven spots won the night
After handling the CPG giant's business at Publicis
Gail Hollander now leads the marketer's in-house efforts to modernize brands for new audiences
The free newsletter covering the top industry headlines
have you ever wondered you can create a new friend of yours
Well it might not be that intelligent but it not worthless to try creating something new
we can rely on various UI elements to interact with users
we can develop rich web applications with natural user interactions and minimal visual interface
This enables countless use cases for richer web applications
the API can make web apps accessible,helping people with physical or cognitive disabilities or injuries
The future web will be more conversational and accessible
Here, we will use the API to create an artificial intelligence (AI) voice chat interface in the browser. The app will listen to the user’s voice and reply with a synthetic voice. Because the Web Speech API is still experimental, the app works only in supported browsers
both speech recognition and speech synthesis
are currently only in the Chromium-based browsers
Edge and Safari support only speech synthesis at the moment
npm i apiaiSocket.io : npm install socket.iodotenv: npm i dotenv-extendedExpress: npm install express --saveSetting Up Your ApplicationSet up a web app framework with Node.js
and set up your app’s structure like this:
run this command to initialize your Node.js app:
This will generate a package.json file that contains the basic info for your app
install all of the dependencies needed to build this app:
Socket.IO is a library that enables us to use WebSocket easily with Node.js
By establishing a socket connection between the client and server
our chat messages will be passed back and forth between the browser and our server
as soon as text data is returned by the Web Speech API (the voice message) or by API.AI API (the “AI” message)
let’s create an index.js file and instantiate Express and listen to the server:
Now,we will integrate the front-end code with the Web Speech API
The UI of this app is simple: just a button to trigger voice recognition
Let’s set up our index.html file and include our front-end JavaScript file (script.js) and Socket.IO
which we will use later to enable the real-time communication:
To style the button , refer to the style.css file in the source code
In script.js, invoke an instance of SpeechRecognition
the controller interface of the Web Speech API for voice recognition:
We’re including both prefixed and non-prefixed objects
because Chrome currently supports the API with prefixed properties
we are using some of ECMAScript 6 syntax in this tutorial
are available in browsers that support both Speech API interfaces,Speech Recognition and SpeechSynthesis
Optionally, you can set varieties of properties to customize speech recognition:
capture the DOM reference for the button UI
and listen for the click event to initiate speech recognition
use the result event to retrieve what was said as text
This will return a SpeechRecognitionResultList object containing the result
and you can retrieve the text in the array
this will return confidence for the transcription
Socket.IO is a library for real-time web applications
It enables real-time bidirectional communication between web clients and servers
We are going to use it to pass the result from the browser to the Node.js code
and then pass the response back to the browser
You may be wondering why are we not using simple HTTP or AJAX instead
You could send data to the server via POST
we are using WebSocket via Socket.IO because sockets are the best solution for bidirectional communication
especially when pushing an event from the server to the browser
we won’t need to reload the browser or keep sending an AJAX request at a frequent interval
Instantiate Socket.IO in script.js somewhere:
insert this code where you are listening to the result event from SpeechRecognition:
let’s go back to the Node.js code to receive this text and use AI to reply to the user
To build a quick conversational interface, we will use API.AI because it provides a free developer account and allows us to set up a small-talk system quickly using its web interface and Node.js library
or get your own by visiting the official site(Getting Started)and signing up
Now we will use the server-side Socket.IO to receive the result from the browser
Once the connection is established and the message is received
use the API.AI APIs to retrieve a reply to the user’s message.When API.AI returns the result
use Socket.IO’s socket.emit() to send it back to the browser
Create a function to generate a synthetic voice
we are using the SpeechSynthesis controller interface of the Web Speech API
The function takes a string as an argument and enables the browser to speak the text:
You might notice that there is no prefixed property this time: This API is more widely supported than SpeechRecognition
and all browsers that support it have already dropped the prefix for SpeechSysthesis
Then, create a new SpeechSynthesisUtterance() instance using its constructor, and set the text that will be synthesised when the utterance is spoken. You can set other properties
such as voice to choose the type of the voices that the browser and operating system should support
use the SpeechSynthesis.speak() to let it speak
get the response from the server using Socket.IO again
It's done.Run the following command in your terminal
And search localhost:3000 in any supported browser
You can refer to my repository for further help
Speeding Up Merck’s Process from 6 Months to 6 Hours with an AI R&D Assistant
60% More Engagement With Hyper-Personalization for a US Proptech
A Step by Step Guide to De-Risking Product Globalization
Mastering Product Information Management: Your Key to the Digital Age
Pavel Averin
TV series and film had a computer that could be operated by voice
Netguru builds digital products that let people do things differently. Share your challenge with our team and we'll work with you to deliver a revolutionary digital product. Create a product with Netguru.
nobody would actually implement it - there are so many more useful things you can do with your developers’ time (and your app budget) than building features that don’t benefit the final user
voice control is really important for the following reasons:
Sounds like something you wish your users had
There’s one more way to use voice control:
you need to understand the difference between speech recognition and natural language processing
Speech recognition converts the spoken word to written text
you can dictate messages or emails to your device and then send them
You can also use text-to-speech (TTS) techniques to imitate the voice
you can check how a word is supposed to sound.Natural Language Processing is a much more advanced field of computer science that is concerned with understanding the meaning of the user’s phrase
It uses artificial intelligence and machine learning to catch what you actually meant when you spoke to the device
If you want your app to let the user order a pizza or book tickets on the next flight to Hawaii
it needs a natural language processing engine
The app can only help you when it understands the true meaning of your request
It should not only HEAR you but also UNDERSTAND you
Let’s assume that our goal is to create a personal assistant mobile application called Lucy (“à la Siri”)
We’ve defined the requirements we need to achieve the best user experience
We’ve compared the following technologies: PocketSphinx
We are using PocketSphinx as an offline solution only for keyword recognition
we’ll require some cloud service to handle the request
Alexa Voice Service (AVS) is a cloud speech-recognition service from Amazon
“Alexa” is the wake-up word and starts the conversation
Our service gets called when customers use our invocation name
ask Lucy to say hello world.” This example is a simple command-oriented one
ASK also supports more sophisticated multi-command dialogues and parameter passing
How wit.ai works:
IBM Watson is a powerful tool for machine learning and analytics. Basically, it focuses on analysing and structuring data and has speech-to-text and text-to-speech solutions
but it doesn’t fit the purpose of our application
Nuance provides many voice recognition and natural language processing services. It has a ready solution for mobile speech-recognition: VoCon Hybrid
which could solve our most difficult issue - custom keyword recognition
Sensory is another expert in the speech-recognition field
TrulyHandsfree is one of the solutions they offer and it looks promising
and we recommend it if you want a high-quality application
Nuance technology is not available for free usage
We hope this article gives you a comprehensive introduction to the speech-recognition and natural language processing solutions available at the moment
doubts or suggestions - don’t hesitate to leave a comment
usable products with blazing-fast efficiency
Let's talk business
which enable developers to more easily convert bots for use on chat platforms like Facebook Messenger
bots on different platforms can “share knowledge” with each other via machine learning
thereby cutting down on developer costs and time spent on building and maintaining these bots
platforms such as Messenger and Slack have different natural language processing (NLP) platforms
That means developers need to build different bots for each channel
By allowing bots to “share knowledge” between platforms
developers need to train their bots only once, making it easier to build complex apps with growing capabilities
This could result in a massive ramp-up in the number of high-functioning intelligent apps and bots hitting the market
likely with the intention of building machine learning into Swift
This would make it easier for developers to build apps with artificial intelligence (AI) capabilities
Because machine learning enables software to better itself
which means that publishers will need fewer developers and data scientists
Building the enabling technology into a software framework will also improve the overall user experience
the enabling technology available to developers has been relatively unsophisticated
at times leading to a disappointing user experience
Some platforms are starting to change this
provides access to tools for building basic chatbots
and its bot engine — built with the same highly sophisticated AI software that Facebook used to create M
its virtual assistant — enables developers to create far more intelligent virtual agents
capable of utilizing machine learning to vastly improve interactions with consumers