click to view more

Multimodal Systems in Practice: A Technical Guide to Agents, Tools, and Architectures That Combine V

Name: Multimodal Systems in Practice: A Technical Guide to Agents, Tools, and Architectures That Combine V
Price: 9.75 USD
Availability: InStock
Author: Laurenson, Ben
ISBN: 9798292171553

by Laurenson, Ben

$9.75

List Price: ~~$11.99~~

Save: $2.24 (18%)

In Stock - Ship in 24 hours with Free Online tracking.
FREE DELIVERY by Friday, July 18, 2025

24/24 Online
Yes High Speed
Yes Protection

Description

A Technical Guide to Agents, Tools, and Architectures That Combine Vision, Speech, and Language

Unlock the future of artificial intelligence with the first deeply practical guide to building and understanding multimodal AI systems - agents and architectures that can see, hear, speak, and reason.

Whether you're a machine learning engineer, AI researcher, or technical product leader, Multimodal Systems in Practice equips you with the tools, frameworks, and know-how to build powerful multimodal agents using vision, speech, and language models - all in real-world settings.

This definitive guide covers cutting-edge systems like GPT-4o, Gemini 1.5, Claude 3.5, ImageBind, Whisper, Sora, Runway, and more - showing how they work under the hood and how to integrate them in practical pipelines.

What You'll Learn:

How Multimodal AI Models Work: Understand the architectures behind vision-language models (VLMs), audio-text agents, and real-time multimodal perception.
How to Build with LangChain, Hugging Face & OpenAI Tools: Construct multimodal pipelines using popular frameworks.
How to Combine Images, Audio & Text in One System: Step-by-step examples for building agents that speak, see, listen, and act in real time.
How to Evaluate & Deploy Multimodal Systems: Master benchmarking, memory management, and safety protocols for production-ready systems.
How to Navigate Ethics and Risks: Address hallucinations, deepfake risks, prompt injection attacks, and visual bias.

Key Topics Include:

Multimodal representation learning (CLIP, FLAVA, Flamingo)
Real-time speech + text agents with Whisper and SeamlessM4T
Tool-augmented agents using LangChain and CrewAI
Multimodal retrieval and long-horizon context with memory buffers
Video understanding models like Sora and Make-A-Video
Autonomous agent design, multimodal reinforcement learning, hybrid AI systems

Who This Book Is For:

AI engineers & ML developers building production-grade AI systems
Technical product teams deploying intelligent assistants and agents
Researchers exploring cross-modal representation and fusion
Advanced practitioners working with vision-language or speech-language models
Builders experimenting with multimodal LLMs like GPT-4o, Gemini, Claude

Last updated on 2025-07-14 15:24:28.242

Product Details

Jul 12, 2025 Pub Date:
9798292171553 ISBN-10:
9798292171553 ISBN-13:
English Language

Money Back

Love it! Use it! Reuse it!

Free Shipping

Shipping is on us

Free Support

24/24 available

Best Deal

Quality guaranteed

Science

Math

General

The New York Times® Bestsellers