Teaching AI Agents to Join Google Meet and Speak

One of the wilder features in AgenticMail Enterprise is meeting intelligence. An AI agent can join a Google Meet call, listen to the conversation, speak when spoken to, transcribe everything in real time, and generate structured meeting notes afterward. Getting this working required solving several problems that I hadn’t seen anyone tackle together.

Joining the Call

The agent joins Google Meet through Playwright browser automation. It launches a Chromium instance, navigates to the meeting URL, handles the “join now” flow, and enters the call. This sounds straightforward, but Google Meet’s UI is a moving target. The join flow involves dismissing permission prompts, handling pre join screens where you configure your mic and camera, and dealing with lobby/waiting room states.

The Playwright automation is resilient to UI changes because it uses a combination of selectors: ARIA labels, data attributes, and text content matching. When Google ships a UI update, the selector priority system means most changes don’t break the automation. For the ones that do, updating a selector map is a five minute fix.

The agent joins with camera off and microphone muted by default. It identifies itself with a configurable display name, typically something like “AI Assistant (AgenticMail)” so participants know there’s an agent on the call.

Speaking with ElevenLabs TTS

Giving the agent a voice was the hardest part. The audio pipeline works like this:

The agent generates a text response based on the conversation context.
The text goes to ElevenLabs for text to speech synthesis. ElevenLabs produces natural sounding speech that doesn’t trigger the uncanny valley reaction you get from older TTS systems.
The generated audio gets routed through a virtual audio device created with sox. This virtual device appears as a microphone input to the Chromium instance running Google Meet.
The agent unmutes, the audio plays through the virtual mic, and participants hear the agent speak.

The sox virtual audio device setup is platform specific. On macOS, it uses sox and a virtual audio driver to create a loopback device. The audio quality is clean because there’s no physical speaker/microphone path introducing noise or echo.

Latency is the critical metric here. From the moment the agent decides to speak to the moment participants hear it, the total pipeline latency is roughly 1.5 to 3 seconds depending on the length of the utterance. ElevenLabs streaming helps because the audio starts playing before the full synthesis is complete. For short responses (“I’ll add that to the action items”), the latency feels natural, like a person collecting their thoughts before speaking.

Real Time Transcription

While the meeting is happening, the agent is transcribing everything. It captures the audio stream from the Google Meet tab and processes it through a speech to text pipeline. Speaker diarization (identifying who said what) works by correlating the audio with the participant list visible in the Meet UI.

The transcription runs continuously, building up a full record of the conversation as it happens. The agent uses this live transcript as its context for deciding when and how to contribute. If someone asks “can someone summarize what we’ve agreed on so far?” the agent has the full conversation available to generate that summary.

Meeting Notes Generation

After the call ends (or on demand during the call), the agent generates structured meeting notes. These include:

Summary. A concise overview of what was discussed and decided.

Action items. Extracted from the conversation with assignees where mentioned. “John, can you handle the deployment by Friday?” becomes an action item assigned to John with a Friday deadline.

Key decisions. Statements that indicate a decision was made, extracted and listed separately from general discussion.

Open questions. Topics that were raised but not resolved, flagged for follow up.

Attendees and duration. Who was on the call and how long it lasted.

The notes can be automatically sent to participants via email, posted to a Slack channel, or saved to the knowledge base for future reference.

Technical Challenges

The biggest challenge isn’t any single component. It’s the coordination. The browser automation, audio pipeline, transcription engine, and LLM inference all need to operate simultaneously with tight timing. A hiccup in the audio pipeline while the agent is speaking creates an awkward silence. A delay in transcription means the agent might miss context.

I run the audio pipeline and browser automation as separate processes communicating through shared buffers. The transcription runs in its own thread with a sliding window over the audio stream. LLM inference is async, so the agent can process a response while still transcribing incoming speech.

It’s not simple. But watching an AI agent join a meeting, listen to the discussion, answer a question verbally, and then email everyone the notes afterward is one of those moments where the future feels tangible.

Source Code

View the full source on GitHub