Conducting
the Space
Language as intention · Sensors as nuance ·
A live performance and installation
Language as intention · Sensors as nuance ·
A live performance and installation
This work is conceived as a 45-minute live performance in which I "conduct" the system through spoken natural language. That linguistic intent is integrated with multimodal sensor data capturing the performer's body and the space — breathing, heart rate, EEG, IMU, and camera-based sensing — to generate and control, in real time, a unified environment of sound, spatial audio, visuals, and lighting.
After the performance, the same system transitions into a public, audience-accessible exhibition mode, allowing visitors to explore and interact with how language and bodily signals translate into spatial expression.
In this work, natural language carries "intention" — discrete, symbolic direction — while sensors carry "continuous nuance" — intensity, fluctuation, and embodied state. Using both simultaneously means that even the same verbal instruction can produce different outcomes depending on physiological state and the dynamics of the room.
Conduction becomes visible not as a fixed command, but as an ongoing negotiation with the space.
Internally, the work is not a single monolithic model. It is designed as multiple specialized agents orchestrated for real-time operation:
◦ Conductor — structures spoken input into actionable directives
◦ Sensor Fusion — normalization and reliability estimation across sensors
◦ Domain Agents — specialized generation and control per domain (audio / spatial audio / visuals / lighting)
◦ Safety Governor — hard limits and prohibited behaviors, including a manual intervention path
Overview — The performer's spoken language is captured, interpreted by Claude API in real time, and converted into a structured JSON control object. That JSON simultaneously drives an original modular synthesizer and software synths (CV), 32 spatial audio speakers, stage lighting (32 ceiling spotlights + 32 floor/space bar LEDs via DMX512), real-time generative graphics, and spatial tracking (PosiStageNet). On top of this discrete intent layer, multimodal sensor data — especially gesture — continuously modulates all output domains, adding physiological nuance that no verbal instruction alone could encode.
The performer's voice is captured via a low-latency, noise-cancelling wireless microphone and streamed to a speech-to-text engine running locally to minimize round-trip time. The transcribed text is sent to the Claude API with a carefully engineered system prompt that defines the JSON schema, safety constraints, and the mapping vocabulary between artistic intent and technical parameters.
Claude does not simply classify commands — it interprets compositional intent. A phrase such as "let the low frequencies drift stage-left while the upper partials scatter" is parsed into a structured JSON object containing per-domain directives with target values, transition curves, and timing envelopes. The system prompt includes the full parameter space for each output domain, ensuring that every response from the API maps directly to actionable control data without intermediate parsing.
To maintain real-time responsiveness during a 45-minute performance, the system uses Claude API's streaming mode, allowing partial JSON to begin driving output parameters before the full response is complete. A local validation layer checks structural integrity and enforces range constraints before any value reaches the hardware layer.
The central JSON object acts as the system's nervous system — a single structured document that carries directives for every output domain simultaneously. The schema is versioned and strictly typed, with each domain receiving its own namespace:
{ "cv": { "osc1_freq": 220, "osc1_shape": 0.7, "filter_cutoff": 3200, "envelope": "slow_rise", ... },
"dmx": { "universe": 1, "spots": { "intensity": 0.6, "color_temp": 3200, ... }, "bars": { "r": 180, "g": 40, "b": 90, ... } },
"video": { "scene": "particles", "palette": "warm_drift", "density": 0.7, "gesture_reactivity": 0.9, ... },
"posinet": { "sources": [{ "id": 1, "x": -2.3, "y": 0.8, "z": 1.5 }], "interpolation": "cubic" },
"spatial_audio": { "engine": "dbs", "objects": [{ "id": 1, "azimuth": -30, "elevation": 15, "spread": 0.4 }] },
"transition": { "duration_ms": 2000, "curve": "ease_in_out" },
"meta": { "scene": "drift", "intensity": 0.65, "timestamp": 1719483920 } }
Each API response produces a delta — only the parameters that change are included, merged into the running state on the control server. This minimizes both API token usage and the risk of unintended parameter resets. The "transition" block allows Claude to specify how quickly changes should unfold, giving the AI compositional control over temporal dynamics, not just target values.
CV (Control Voltage) — Original Modular Synth + Software
The JSON "cv" namespace maps directly to parameters of an original custom modular synthesizer and software synthesizers via a DC-coupled audio interface (e.g., Expert Sleepers ES-9) and MIDI/OSC. The modular system — designed specifically for this work — provides raw analog signal paths for oscillator frequency, waveshape, filter cutoff/resonance, amplitude envelopes, LFO rates, and modulation depths. Software synths run in parallel for polyphonic and sample-based layers. Claude can specify musically meaningful targets ("a dark, resonant drone at 55 Hz") which the system prompt translates into precise CV values and software parameters simultaneously. Transition curves allow smooth portamento, sudden cuts, or evolving timbral morphs across both hardware and software domains.
DMX512 — 32 Spotlights + 32 Bar LEDs
The "dmx" namespace addresses 64 lighting fixtures across two DMX universes via Art-Net or sACN over Ethernet. The rig consists of 32 motorized spotlights mounted on the ceiling grid — providing focused, directional light that can track, isolate, and sculpt the performer and space — and 32 bar LED fixtures positioned at floor level and along walls, used to wash the floor plane and saturate the spatial volume with color. Each fixture's channel map (RGB/RGBW, intensity, pan/tilt, zoom, strobe) is pre-registered in the system configuration. Claude's directives are translated to per-channel values at 44 Hz refresh rate. The system supports both individual fixture addressing and group-based scene control: Claude can describe high-level states ("ceiling spots narrow to a single white point, floor bars deep blue breathing slowly") or specify per-fixture granularity. The dual-layer design — focused spotlights above, diffuse color below — allows Claude to compose lighting as a spatial material rather than mere illumination.
PosiStageNet (PSN) — Spatial Tracking
PosiStageNet is an open protocol for transmitting real-time 3D position data across a show control network. In this system, it serves a dual purpose: (1) receiving tracked positions of the performer and physical objects from cameras or IMU sensors, and (2) transmitting virtual source positions to the spatial audio engine. The JSON "posinet" namespace defines source positions in a shared 3D coordinate system (meters, stage-centered origin), with interpolation modes that determine how positions transition — cubic for smooth spatial movement, linear for direct jumps, or spline-based for choreographed trajectories.
Spatial Audio — 32-Speaker Array (d&b Soundscape / Yamaha AFC)
A 32-speaker array surrounds the performance space, driven by either d&b Soundscape (DS100 signal engine with En-Scene/En-Space processing) or Yamaha AFC (Active Field Control / AFC Image), selected based on venue infrastructure. The JSON "spatial_audio" namespace defines per-object parameters: azimuth, elevation, distance, spread (source width), and reverb send levels. Sound objects from the modular and software synthesizers are spatialized individually, mapped to the physical loudspeaker array via the engine's own room model. Claude can describe spatial behaviors in natural language ("voices circling the audience at head height, slowly rising") which are decomposed into object trajectories with per-frame position updates. The 32-speaker configuration provides full-sphere coverage with sufficient angular resolution for precise object localization — essential for making the spatial dimension of the performance legible to the audience.
Real-time Generative Graphics
The "video" namespace controls real-time generative visuals projected or displayed within the performance space. Graphics are generated by a GPU-accelerated rendering engine (GLSL / openFrameworks / TouchDesigner) that receives both Claude API directives and raw sensor data simultaneously. Claude sets the compositional framework — scene selection, color palette, particle density, geometric structure, motion behavior — while the performer's gesture (IMU angular velocity, skeletal tracking) and physiological state directly modulate visual parameters in real time. A spoken instruction like "dissolve into scattered light, cool tones" triggers Claude to output a scene change with palette and density targets, while the performer's hand movements continuously deform, scatter, and reshape the visual field. This dual-input model means the graphics are never purely algorithmic nor purely gestural — they exist in the same negotiated space as every other output domain.
While Claude API provides the compositional skeleton — the "what" and "where" of each moment — the sensor layer provides the "how much" and "how intensely." Multimodal sensor data continuously modulates the parameters that Claude has set, adding a physiological dimension that transforms discrete instructions into living, breathing expressions.
The sensor fusion module normalizes data from five streams: respiratory rate and depth (chest-band or impedance pneumography), heart rate and HRV (PPG or ECG), EEG power bands (alpha, beta, theta — via a dry-electrode headband), 9-axis IMU (accelerometer, gyroscope, magnetometer — for gesture and posture), and camera-based skeletal tracking (body pose, facial expression, movement velocity).
Each sensor stream is mapped to specific modulation targets. For example: breathing depth scales the filter cutoff range that Claude specified — deeper breath widens the sweep. Heart rate variability modulates the DMX strobe/pulse timing, creating rhythmic lighting that synchronizes with the performer's cardiac state. EEG alpha-band power (associated with relaxed attention) scales the spatial audio spread parameter — greater calm produces wider, more diffuse sound fields. IMU angular velocity maps to both CV modulation depth and visual particle dispersion — faster gesture produces more aggressive timbral modulation and more explosive visual dynamics. Camera-tracked body pose drives graphic deformation matrices, so the performer's silhouette and movement directly sculpt the projected imagery.
Critically, sensor values do not override Claude's directives — they modulate within the ranges that Claude has defined. If Claude specifies a filter cutoff of 3200 Hz with a modulation range of ±1600 Hz, the breathing sensor sweeps within that 1600–4800 Hz window. This ensures that the AI's compositional intent is preserved while the performer's body adds continuous, unpredictable nuance. The result is that even when the same verbal instruction is repeated, the output differs each time — shaped by the performer's evolving physiological state.
The end-to-end latency budget targets under 300 ms from spoken word to audible/visible change — within the threshold of perceived musical responsiveness. The budget is distributed as: speech-to-text ~80 ms (local Whisper inference on GPU), Claude API streaming first-token ~120 ms, JSON validation and dispatch ~5 ms, protocol transmission ~10 ms, and device response ~30–80 ms depending on the output domain. For sensor modulation, the latency is significantly lower (~15 ms total) as data flows directly from the sensor fusion module to the output layer without API involvement.
A central control server (Node.js / Python) acts as the orchestration hub, receiving JSON from the Claude API, merging it with the running state, applying sensor modulation, and dispatching to each output protocol simultaneously via dedicated network interfaces. All inter-process communication uses shared memory or local UDP to minimize overhead.
Because the AI landscape is evolving rapidly, the architecture is intentionally modular so that we can incorporate the most suitable tools during production based on defined performance and safety criteria.
For live reliability, we implement a version freeze window prior to presentation, and robust fallback behavior — safe presets and manual override routes — to handle network issues, recognition failures, or sensor dropouts.
sonicPlanet is an enterprise specializing in AI-driven audio and lighting control plugins and software development. They will develop the core technology for this project.
By leveraging and extending SPAT AI — their already-released AI-powered spatial audio software — we ensure that development is reliable, cost-effective, and rapid. A proven foundation means less risk and more time for creative exploration.
Sinan Bokesoy, my long-time collaborator on sonicPlanet and Sonic Lab, will also participate in this project.
SPAT AI →The space listens.
The body speaks.
The negotiation continues.