click to view more

Multimodal Systems in Practice: A Technical Guide to Agents, Tools, and Architectures That Combine V

by Laurenson, Ben

$9.75

List Price: $11.99
Save: $2.24 (18%)
add to favourite
  • In Stock - Ship in 24 hours with Free Online tracking.
  • FREE DELIVERY by Friday, July 18, 2025
  • 24/24 Online
  • Yes High Speed
  • Yes Protection

Description

A Technical Guide to Agents, Tools, and Architectures That Combine Vision, Speech, and Language

Unlock the future of artificial intelligence with the first deeply practical guide to building and understanding multimodal AI systems - agents and architectures that can see, hear, speak, and reason.

Whether you're a machine learning engineer, AI researcher, or technical product leader, Multimodal Systems in Practice equips you with the tools, frameworks, and know-how to build powerful multimodal agents using vision, speech, and language models - all in real-world settings.

This definitive guide covers cutting-edge systems like GPT-4o, Gemini 1.5, Claude 3.5, ImageBind, Whisper, Sora, Runway, and more - showing how they work under the hood and how to integrate them in practical pipelines.


What You'll Learn:
  • How Multimodal AI Models Work: Understand the architectures behind vision-language models (VLMs), audio-text agents, and real-time multimodal perception.

  • How to Build with LangChain, Hugging Face & OpenAI Tools: Construct multimodal pipelines using popular frameworks.

  • How to Combine Images, Audio & Text in One System: Step-by-step examples for building agents that speak, see, listen, and act in real time.

  • How to Evaluate & Deploy Multimodal Systems: Master benchmarking, memory management, and safety protocols for production-ready systems.

  • How to Navigate Ethics and Risks: Address hallucinations, deepfake risks, prompt injection attacks, and visual bias.


Key Topics Include:
  • Multimodal representation learning (CLIP, FLAVA, Flamingo)

  • Real-time speech + text agents with Whisper and SeamlessM4T

  • Tool-augmented agents using LangChain and CrewAI

  • Multimodal retrieval and long-horizon context with memory buffers

  • Video understanding models like Sora and Make-A-Video

  • Autonomous agent design, multimodal reinforcement learning, hybrid AI systems


Who This Book Is For:
  • AI engineers & ML developers building production-grade AI systems

  • Technical product teams deploying intelligent assistants and agents

  • Researchers exploring cross-modal representation and fusion

  • Advanced practitioners working with vision-language or speech-language models

  • Builders experimenting with multimodal LLMs like GPT-4o, Gemini, Claude

Last updated on

Product Details

  • Jul 12, 2025 Pub Date:
  • 9798292171553 ISBN-10:
  • 9798292171553 ISBN-13:
  • English Language