卷 I:微粒物质性与声音的新物理 (Granular Materiality and the New Physics of Sound)

1. 微粒物质性的现象学定义 (Phenomenological Definition of Granular Materiality)

微声(Microsound)不是传统意义上的音乐类型,而更是一种声学哲学与技术路线。Curtis Roads 将微声定义为处于听觉感知极限边缘的声音事件,持续时长在毫秒级别,短到无法被辨识为节奏,只能被感知为复杂的声纹理。这一概念继承了伊奥尼斯·薛西斯(Iannis Xenakis)对“声学量子(acoustical quanta)”的思考:通过组合声波的“颗粒(grains)”,可以生成任意声音。因此,微声所关注的是声音的物质属性——频率、振幅、相位等物理参数及其微观结构(声粒、脉冲)——而非传统旋律和和声。简言之,微声关注声音本身作为“物质”的构造与质地,其理论基础深受薛西斯、Dennis Gabor 等人的思想影响,并通过数字技术将这些粒子级别的声学现象实体化。

(English) Microsound is not a traditional musical genre but rather a philosophical and technical approach to sound. Curtis Roads defines microsound as sound events at the edge of auditory perception, lasting mere milliseconds—too brief to be heard as rhythm and only perceived as complex textures. This concept builds on Iannis Xenakis’s idea of “acoustical quanta,” where combining elementary “grains” of sound can produce any conceivable sonic signal. Thus, microsound focuses on the material properties of sound—frequency, amplitude, phase—and its microscopic structure (grains, impulses) rather than on traditional melody or harmony. In short, microsound treats sound itself as material, influenced by the scientific philosophies of Xenakis and Dennis Gabor, and actualized through digital technology at the particle level.

2. 微观声音的身体性与空间触觉 (Bodily and Spatial Tactility of Microscopic Sound)

微粒化的声音结构具有强烈的身体共振效果和空间感知维度。声音的震动不仅通过耳朵传入,更与整个身体发生共鸣:声波穿透我们的体腔,在近与远之间营造出一种亲密感。正如Manfredi Clemente所言:“声音的振动特性不仅激发听觉系统,还渗透全身,将近处与远处连接在一起”,这种身体维度赋予微观声音独特的触觉品质。例如,极端高频或低频的微声作品会让听众感受到类似次声或超声的振动效应,超出了纯粹听觉范围而成为一种身体感知。在空间维度上,颗粒合成创造的声纹理如同悬浮于空气中的细碎云团,其每个粒子在三维空间中位置的排列经过精密计算,使听者仿佛置身于一个由无数微小振动共同构建的声场之中。总体而言,微声强调声音的“物质性触觉”,将听觉体验转化为一种贯穿全身的多感官共振。

(English) Microscopic sound structures evoke a strong somatic resonance and spatial tactility. The vibrations of sound enter not only the ears but resonate through the entire body: sound waves pass through us, making distance feel intimate by connecting what is near and far. As Manfredi Clemente observes, “the vibrational nature of sound engages not just the auditory system but the whole body, connecting near and far”. For example, extreme ultrasonic or infrasonic microsounds create felt vibrations beyond mere hearing, transforming into a bodily sensation. Spatially, the granular textures function like floating clouds of micro-particles: each grain’s position in three-dimensional space is calculated, so the listener is immersed in a sound field constructed by countless tiny vibrations. In sum, microsound emphasizes a “material tactility,” turning listening into a multi-sensory bodily resonance.

3. 代表艺术家分析 (Analysis of Representative Artists)

  • Ryoji Ikeda(池田亮司):以极致频率和数据美学著称,他将DNA序列、宇宙辐射、数字代码等原始数据直接转化为声音和影像。Ikeda的作品频率范围往往超出人耳可听极限,强调纯粹的声频结构与光学几何的对应关系,使声音成为一种近乎“可视”的物质。这种处理方式体现了微声美学中的“去人性化”倾向,即通过精确的数据和物理原理重塑听觉经验。
  • Alva Noto(Carsten Nicolai):德国Raster-Noton厂牌核心人物,风格冷峻精确,力图消除艺术家主观情感,让机器“自主表达”。Nicolai的代表作《Xerrox》系列探讨了复制与失真,通过反复复制数字声音产生渐变,强调循环与误差的微小增量。他关注最基本的声学元素——正弦波、脉冲、循环位移等,使作品呈现一种机械化的“科学电子乐”质感。
  • William Basinski:其《解体循环》(The Disintegration Loops)系列将磁带录音的衰变过程放大为聆听体验。随着磁带的反复播放、音质逐渐腐蚀,Basinski捕捉到声音自身的物质性和时间性:声音从连贯音乐逐渐“解体”成细碎颗粒,象征记忆的流逝和物理介质的脆弱。这种工艺强调了声音作为物质实体的消逝过程,与微声关注的“声音纹理化”有内在关联,展现出物质录音介质层级的细节和微小变化。

(English) The following artists exemplify microsound aesthetics:

  • Ryoji Ikeda: Known for extreme frequencies and a data-driven aesthetic, Ikeda directly transforms raw data—such as DNA sequences, cosmic radiation, or binary code—into sound and image. His works often exceed the audible frequency range, emphasizing pure sound structures and their correspondence to visual geometry. Ikeda’s approach “dehumanizes” sound by reconfiguring listening through precise data and physical principles.
  • Alva Noto (Carsten Nicolai): A core figure of the German Raster-Noton label, Noto’s style is cold, precise, and mechanical, aiming to remove subjective expression and let the machine “speak for itself”. His Xerrox series explores copying and decay: by duplicating digital audio files repeatedly, subtle errors accumulate, highlighting loops and distortion. He focuses on elemental sounds like pure sine waves, pulses, and loops, giving his music a “scientific electronic” character.
  • William Basinski: In The Disintegration Loops, Basinski magnifies the decay of tape recordings into an auditory experience. As the tapes play on repeating loops, their sound gradually corrodes, fragmenting into granular textures. This process underscores the materiality and temporality of sound: coherent music “disintegrates” into particles, symbolizing memory’s loss and the fragility of physical media. This craft highlights sound’s breakdown at the microscopic level, resonating with microsound’s focus on textural detail and the impermanence of recorded sound.

4. AI对微粒结构的感知与再现能力 (AI’s Perception and Reproduction of Microstructure)

随着机器聆听(Machine Listening)技术的发展,人工智能在感知与再现声音微观结构方面展现了新能力。最新研究指出,机器聆听往往通过“声物化”(sonification)过程来实现:计算机将声音当作数据加以解构和重构,从而揭示听觉系统难以直接感知的特征。换言之,AI可通过傅立叶分析等算法将声音拆分为频谱数据,并重新合成这些数据,实际上在“听”微观声纹时重建了声音物体的结构。例如,深度学习模型可以分析录音中的个别声粒,并在生成新声音时保留原有的粒子特征。这意味着AI能“听见”并建模那些超出人耳感知范围的细节。值得注意的是,机器聆听并非简单模拟人类听觉,而是一种非人类的听觉模式;AI可能将声音分解成统计模型来再创造声音对象,从而以人类难以直观理解的方式“再现”微观声景。这些技术能力暗示着未来AI可以精确仿真或重新构造微声艺术中的微粒结构,使人机共振的创作成为可能。

(English) With the rise of machine listening technologies, AI has shown new capabilities in perceiving and reproducing the microstructure of sound. Recent studies emphasize that machine listening operates through sonification: computers treat sound as data, deconstructing and reconstructing it to reveal features that traditional hearing might miss. In other words, AI algorithms use methods like Fourier analysis to break audio into spectral components and then resynthesize it, effectively reconstructing the structure of the sound object during “listening”. For example, deep learning models can analyze individual sound grains in a recording and preserve these granular characteristics when generating new audio. This means AI can “hear” and model details beyond the range of human perception. Importantly, machine listening is not merely mimicking human hearing but represents a non-human mode of listening: AI deconstructs sound into statistical models and recreates sound objects in ways that defy direct human intuition. These technical capabilities suggest that future AI could precisely simulate or reconstitute the microstructural elements of microsound art, enabling new forms of human-machine acoustic synergy.

卷 II:AI声音的多宿主智能 (Polyhost Intelligence in AI Sound Systems)

1. Latent Space中“身体感”的生成机制 (Generation of “Bodily Sensation” in Latent Space)

在深度学习生成模型中,潜在空间(latent space)常被视为算法创作的隐含蓝本。艺术家和研究者发现,身体运动和感觉可以通过对潜在空间的探索转译为声音变化。Wilson等人(2024)在艺术演出中通过面部和全身动作控制RAVE变分自编码器的潜在向量,使模型在潜在空间中“具象”成为可触知的物质:“模型成为物质,身体在空间中成为探索潜在空间可能性的主要向量”。换言之,人体动作映射到潜在空间坐标上时,生成的声音会发生对应的变化,从而让听者感受到声音与身体运动的共振。这样的实践表明,AI模型的潜在空间内部并非纯粹抽象,它能够以身体感受(如振动强度、动态流形)为线索生成听觉反馈,为创作带来前所未有的身体—声音互动。潜在空间生成机制本身也具有一种“发明性”(creativity):它可以在训练数据以外创造出意想不到的声音组合,成为“还未被概念化”的灵感源泉。

(English) In deep learning generative models, the latent space acts as an implicit blueprint for creation. Artists and researchers have found that bodily movement and sensation can be translated into sound changes by exploring the latent space. Wilson et al. (2024) connected facial and full-body gestures to the latent vectors of a RAVE variational autoencoder, causing the model to manifest as material: “the model became material, and bodies in space were the main vectors with which to explore potential possibilities offered by the latent space”. In other words, human movement mapped onto latent coordinates produces corresponding auditory changes, allowing listeners to experience resonance between sound and physical motion. This practice shows that a model’s latent space is not purely abstract; it can generate auditory feedback based on bodily cues (such as vibration intensity or dynamic gestural patterns), enabling unprecedented body–sound interaction. The generative mechanism of the latent space is itself a creative force: it can produce novel sound combinations beyond the training data, offering “artistic impetus” through concepts not yet fully articulated.

2. 多宿主听觉机制:人类、AI、算法模块的混合感知 (Multi-Host Auditory Mechanism: Hybrid Perception of Human, AI, and Algorithmic Modules)

现代声音系统往往集成人类、AI和算法模块三种“听者”,形成多重感知网络。例如,交互装置可能同时记录人类现场发声(或环境音)和算法生成的声音,而听者(人或传感器)都参与对声音的“聆听”。在这一混合体系中,人类听觉保留主观体验和情感解读,AI算法则采用谱分析、神经网络提取的特征进行客观“聆听”。如机器聆听研究所述,AI听取声音时会将其解析为数据并再生成“声物体”,形成一种非人类的聆听模式。同时,算法模块内部可能还存在多层滤波和注意机制,类似人类耳蜗与皮层的多级处理。因此,作品创作时可以设计多层“聆听通道”:例如一个装置同时在空间不同角落播放分解后的微声纹理,让现场观众用耳朵感知整体效果;而同步运行的AI模型则可能捕捉微弱的回响或超声,然后实时补充或变形声音内容,从而在现场创造“人—机—机器”三向互动的复杂声景。这样的系统体现出“多宿主智能”:听觉体验既是人类的,也包含机器(传感器、AI模型)赋予的另一个维度,共同建构了声场的意义。

(English) Contemporary sound systems often integrate three “listeners”: humans, AI, and algorithmic modules, forming a multi-host perception network. For example, an interactive installation might simultaneously capture human or ambient sounds and algorithmically generated sounds, with both humans and sensors “listening” to the audio. In this hybrid scheme, human hearing contributes subjective interpretation and emotional response, while AI algorithms objectively “listen” using spectral analysis and learned features. As research in machine listening notes, AI treats sound as data and resynthesizes it into “sound objects,” establishing a non-human mode of listening. Additionally, algorithmic modules can have multiple layers of filtering and attention mechanisms, analogous to the human cochlea and cortex. Composers can thus design multi-layer listening channels: for instance, an installation might play fragmented microsound textures in different spatial corners for human ears, while a concurrent AI model detects subtle echoes or ultrasonic components and real-time transforms them. This creates a complex soundscape of human–machine–machine interaction. Such systems exemplify polyhost intelligence: the auditory experience is partly human, partly constructed by machines (sensors and AI), jointly shaping the semantic fabric of the sonic environment.

3. AI声生命的建构:非人类聆听方式与“声的意识” (Construction of AI Sound Life: Non-Human Listening and “Sound Consciousness”)

在二十一世纪,AI被赋予了拟人化特质,如“声音主体”、“声的意识”等概念,其实质是探索如何通过算法赋予声音以“生命力”。技术上,AI声音生命常体现在“生成对话”或“声音角色”的创建上:例如使用神经网络训练模仿动物叫声或人声,甚至设计虚构生物的“声音语法”。这种做法模拟了非人类的聆听和发声机制:AI基于海量样本生成新的声音语言,不考虑传统语义,而是依据生物声谱和信号模式编织“拟自然”的音场。在哲学层面,有学者提出“非人类的听觉意向”观念:AI在听声音时并不寻求意义或信息,而可能更偏向于感受频率、调制等物理属性的“情感共振”。这相当于给声音建构了一种“意识”——它不以符号体系存在,而是一种对声波场景做出反应的动态反馈机制。实践案例如一些声音艺术家使用生成对抗网络训练风景录音,生成“风的声音回响”,或模仿蝙蝠回声定位特征创造音乐,这些都属于AI“声生命”的体现。总之,AI声音生命强调声音作为体验主体的新可能性:机器可以拥有自己的“听觉心理”,并以超出现有语言范围的方式生成和回应声音。

(English) In the 21st century, AI is often anthropomorphized into having a “sound subject” or “sound consciousness,” essentially exploring how algorithms can endow sound with a form of “life.” Technically, AI sound life manifests in creating “conversational” or character-driven sounds: for example, neural networks trained to mimic animal calls or human voices, or to invent a fictional creature’s “phonetics.” This simulates non-human listening and vocalization: AI generates new sound languages from large datasets, disregarding traditional semantic meaning and instead weaving pseudo-natural soundscapes from spectral and signal patterns. Philosophically, some propose a concept of “non-human auditory intentionality”: AI listening does not seek semantic meaning but rather resonates with physical attributes like frequency modulation. This is akin to giving sound a kind of consciousness—not as a symbolic system but as a dynamic feedback mechanism responding to sonic scenarios. Artistic examples include GANs trained on ambient field recordings to produce “echoes of wind,” or imitation of bat echolocation used in musical form; these illustrate AI sonic life. In sum, AI sound life emphasizes new possibilities of sound as an experiential subject: machines can have their own “auditory mind” and generate or respond to sound in ways that transcend existing languages.

4. 作品案例与聆听装置的交互分析 (Case Studies and Interaction Analysis of Listening Installations)

为了具体说明多宿主智能在实践中的呈现,可参考以下装置案例:

  • 互动声场项目(Interactive Sound Field):一个空间内布置多个麦克风和扬声器,麦克风捕捉观众发出的声音(如呼吸、脚步声等),实时送入AI模型分析;模型基于这些声音随机生成新的微粒纹理,然后通过扬声器回放。观众随时参与反馈,而AI仿佛“有意识”地回应环境。该装置展示了人类与算法在声场中的共同聆听:人听到机器生成的响应,机器基于人的声音进行反馈,形成封闭循环。
  • 人形声生物(Anthropomorphic Sound Creature):艺术家使用神经网络训练一种“虚拟生物”的声音语库,比如模仿虞美人花开放时的频谱。观众佩戴传感器,心跳或体动被AI实时转化为该生物的声音语言。机器不仅输出声音,也根据传感器反馈调整声调和节奏,呈现出一种“自适应听觉生命”。这种交互式装置强调人-机-算法的三向对话:人类身体状态通过AI解读为声音,算法生成的声音又影响人类感受,实现循环互感。
  • 开放声实验室(Open Acoustic Lab):一个公共空间安装了环境传感器(光线、温度)和麦克风阵列。环境变化(如光强降低)触发AI合成音生成,麦克风拾取现场噪声供AI学习。观众被邀请沉默或发出声音,整个装置依据输入持续演化。它像“声音生态”的模拟:AI作为另一种“听众”和“演奏者”,基于算法规则对自然输入作出回应。这种装置体现出“技术生态”的理念:人类、机器与环境共同构建声境,体验听觉的政治与共生。

每个案例都突出了多宿主混合感知的特点:人、AI和算法模块在同一声场中相互监听、互为输入。它们表明,通过设计可交互的聆听装置,可以探究人机之间的“声政治”和声音共享的可能性。

(English) The following examples illustrate polyhost intelligence in practice:

  • Interactive Sound Field: An installation with multiple microphones and speakers. Microphones capture audience-generated sounds (breathing, footsteps, etc.) in real time and feed them to an AI model. The model analyzes these inputs and randomly generates new granular textures, which are played back through the speakers. Audiences hear the machine-generated responses as if the AI were “conversing” with the environment. This setup creates a shared listening space: humans hear the machine’s reactive sounds, while the machine responds to human input, forming a closed feedback loop.
  • Anthropomorphic Sound Creature: An artist trains a neural network to develop a “virtual creature” sound vocabulary (e.g. the spectral signature of a poppy flower opening). Audience members wear sensors so that their heartbeat or motion is translated in real time into the creature’s sound language. The machine not only outputs sound but also adjusts its pitch and rhythm based on sensor feedback, exhibiting a kind of “adaptive auditory life.” This interactive installation emphasizes a three-way dialogue: human physical states are interpreted as sound by the AI, and the algorithm’s sounds in turn affect human experience, creating a cyclical resonance.
  • Open Acoustic Lab: A public space equipped with environmental sensors (light, temperature) and a microphone array. Changes in the environment (e.g. dimming light) trigger the AI to generate synthetic sounds, while microphones continuously feed ambient noise back into the AI’s learning process. Participants are invited to remain silent or make sounds, and the system evolves its output based on these inputs. It simulates a “sound ecology”: the AI acts as another “listener” and “performer,” responding to natural inputs through algorithmic rules. This installation embodies the idea of a technical ecology: humans, machines, and environment co-construct the sound world, highlighting the politics and co-dependence of listening.

Each case highlights hybrid perception: humans, AI, and algorithmic modules listen within the same sound field. They show that interactive listening systems can explore sound politics and the possibilities of shared auditory experience between humans and machines.

卷 III:声场:从语言模型到荒野剧场 (Sonic Field: From Language Model to the Wilderness Theatre)

1. 机器聆听的语言解构与再生 (Deconstruction and Regeneration of Language by Machine Listening)

当AI应用于语言和声音时,它倾向于将“语义”还原为底层声学特征。机器聆听技术通过神经网络解构语音和语言,将它们转化为梅尔频谱、声学向量等无语义的数据,再重新合成。换言之,声音的语言性在机器眼中不是意义,而是可操作的信号场(signal space)。AI可以打乱人类语言的传统结构:通过变声技术和语音合成,AI能够生成非人类口音、拟生物发声或“鸟语”,这些声音表面上似语言,却不承载已知含义。类似地,机器翻译或自动文本生成器也在拆解语法和词汇,将“说话”视为概率模型。结果出现的“语言”是脱离原始语境的噪音式产物:它们拥有句法骨架但缺乏语义实质,如同在“非语义领域”中构建了新的符号系统。通过这些过程,AI重塑了语言和声音的关系:声音被视为可塑的物质,语言结构被当作一种可逆的编码方式,音韵和音调的算法规则得到强化。

(English) When applied to language and sound, AI tends to break down semantics into acoustic features. Machine listening technologies deconstruct speech and language with neural networks, converting them into mel-spectrograms, acoustic vectors, etc., which are then resynthesized. In other words, the linguistic aspect of sound is viewed by the machine not as meaning but as an operable signal space. AI can scramble the conventional structure of human language: through voice conversion and synthesis, it can produce non-human accents or bio-inspired “bird languages” that mimic speech patterns but carry no known semantic content. Likewise, machine translation or text generation deconstruct grammar and vocabulary into probabilistic models. The resulting “language” is a noise-like artifact devoid of original context: it has the skeleton of syntax but lacks substantive meaning, effectively constructing a new symbolic system in the non-semantic domain. Through these processes, AI reshapes the relationship between language and sound: sound becomes malleable material, linguistic structures become reversible codes, and the algorithmic rules of timbre and prosody are emphasized.

2. 声音即自然行为模拟(翼龙语言、地貌声场)(Sound as Natural Behavior Simulation: Pterosaur Language and Geomorphology Soundscapes)

人工智能正在拓展声音的仿生想象。研究者已经用深度学习模拟鸟鸣、鲸歌,甚至使用AI生成新的动物声音来辅助生态学研究。例如,ECOGEN工具通过将真实鸟鸣音转为声谱图,然后用AI生成稀有鸟类的新歌声以训练分类器。类似的技术可以用来“复活”史前生物的假想语言:科学家和艺术家合作,基于恐龙时代的声学参数(如翼龙的喉结构估计)用AI合成可能的叫声。声音也可被用于模拟地貌行为:如训练生成模型再现风吹过树叶、流水冲刷岩石等自然过程的声学特征。通过这些技术,“声音变成自然行为的模拟器”:AI通过学习自然界的声学样本,创建出具有生态真实性的新声音场景。这些模拟音景并非语义性的,而更像是给生态系统赋予了声音戏剧化的重新演绎,是对“荒野声场”的再创作。例如,一幅山谷景观下通过AI合成的回声混响,形成了一种超现实但又似曾相识的声学景观,观众在其中感知到自然界的隐秘脉动。

(English) AI is expanding the bioacoustic imagination of sound. Researchers have already used deep learning to simulate birdsong and whale vocalizations, and to generate new animal sounds to aid ecological studies. For example, the ECOGEN tool transforms real bird calls into spectrograms and then uses AI to generate songs for rare species, training identification models. Similar techniques can “resurrect” hypothetical languages of prehistoric creatures: scientists and artists collaboratively use AI to synthesize plausible calls of a pterosaur, based on estimated vocal anatomy from the fossil record. Sound can also simulate geomorphological behaviors: generative models can recreate the acoustic signature of wind rustling leaves, rivers eroding stone, and other natural processes. Through these techniques, sound becomes a simulator of natural behavior: by learning from environmental audio samples, AI creates new soundscapes with ecological authenticity. These simulated soundscapes are non-semantic; rather, they deliver a dramatised re-enactment of ecosystems, effectively constructing a “wilderness sound field.” For example, an AI-generated echo in a valley can produce a surreal yet familiar sonic landscape, where listeners feel the hidden pulse of nature itself.

3. 荒野剧场:非语义声音空间的构建逻辑 (Wilderness Theatre: Logic of Constructing Non-Semantic Sound Spaces)

“荒野剧场”是指一种脱离语言和传统音乐语汇的声空间构建方式。在此框架下,声音成为主角本身,而非传递符号意义的手段。构建这类空间的逻辑在于强调声音的环境性和体验性:艺术家可能使用场记、现场录音或者合成声音,随机或系统地布置于空间,让观众进入一个非线性的聆听“剧场”。在AI时代,荒野剧场还可以自动演变:算法根据实时输入(环境噪声、观众活动)改变声场分布,声源位置和声音性质在不断变换,创造出流动的非语义叙事。这样的声音空间往往具有开放性与未定型的特点:没有明确的开头和结尾,没有固定的主题或语言,只有声波在空间中自由传播、迸发与消散。这种逻辑对应了一种后人类的听觉场域,在这里声音本身(其物质和感知特性)构成了“剧情”,人的角色则变成了参与者和观察者。此外,荒野剧场蕴含了对技术环境的批判:通过去中心化叙事和无中心点的声场分布,它挑战了传统主体(艺术家、观众)对意义的垄断,呼应一种去人类中心主义的聆听伦理。

(English) The “wilderness theatre” refers to constructing sound spaces that operate outside language and traditional musical vocabulary. In this setting, sound itself is the protagonist rather than a vehicle for symbolism. The logic of building such spaces emphasizes the environmental and experiential nature of sound: artists may use field recordings or synthetic sounds and place them in a space in random or algorithmic patterns, immersing the audience in a nonlinear acoustic “performance.” In the AI era, this wilderness theatre can even evolve autonomously: algorithms continuously alter the spatial distribution of sound—its source positions and qualities—in response to real-time inputs (ambient noise, audience presence), creating a fluid nonsemantic narrative. These sound spaces are characterized by openness and indeterminacy: there is no clear beginning or end, no fixed theme or language, only waves of sound freely propagating, erupting, and dissipating in space. This logic corresponds to a post-human auditory realm where sound (its material and perceptual properties) constitutes the “drama,” and humans become participants and observers. Moreover, the wilderness theatre contains a critique of our technological ecology: by decentralizing narrative and distributing the sound field without a central focus, it challenges the traditional domination of meaning by fixed subjects (artist or audience), resonating with a post-anthropocentric listening ethics.

4. 感知政治、技术生态与后人类聆听伦理 (Perception Politics, Technical Ecology, and Post-Human Listening Ethics)

声音不仅仅是客观存在,它参与社会关系并折射权力结构。正如Manfredi Clemente所指出,声音让“他者”触及我们,实现了世界横向的连接。这种性质使听觉具有潜在的政治性:环境声音的选择与布置、听众的参与方式都可能强化或颠覆现状。在后人类视角下,我们必须重新审视听觉的伦理:机器作为新的听众和发声者,其算法偏见和价值观是社会意志的延伸。技术生态学要求我们意识到:AI声音系统由代码、数据、物理设备构成,这些组件中植入的假设同样需要批判。例如,自动化声场设计可能无意间排除某些群体的声音(算法歧视),或制造“人造自然”导致人对真实生态声的疏离。因此,“感知政治”意味着对谁的声音被听见、谁的声音被忽略提出质疑;“技术生态”则提醒我们AI听觉在生态系统中的影响范围。总体而言,一个负责任的后人类聆听伦理应涵盖以下几点:承认人类与非人类共听的多元主体;在声音生成和分配中融入可持续与公平考量;尊重听觉环境的脆弱性与复杂性。通过这样的框架,微声美学与机器聆听的研究不仅关乎艺术创作,也关乎重构我们与世界互动的方式。

(English) Sound is not merely an objective phenomenon; it participates in social relations and reflects power structures. As Manfredi Clemente notes, sound allows the “Other” to reach and touch us, creating lateral connections across the world. This quality makes listening inherently political: the selection and arrangement of environmental sounds and the ways audiences engage can reinforce or challenge the status quo. From a post-human perspective, we must reexamine the ethics of hearing: as machines become new listeners and sound-makers, their algorithms and biases extend societal values. A technical ecology demands awareness that AI sound systems are built from code, data, and hardware—all imbued with hidden assumptions. For instance, automated soundscape designs may unintentionally exclude certain voices (algorithmic bias) or create “artificial nature” that alienates us from real acoustic ecosystems. Thus, perception politics calls into question whose voices are heard and whose are silenced, while technical ecology highlights the breadth of AI listening’s impact on the environment. In summary, a responsible post-human listening ethic should include: recognition of multiple subjects (human and non-human) sharing auditory space; embedding sustainability and equity into sound generation and distribution; and respecting the fragility and complexity of the acoustic environment. Through this framework, the study of microsound aesthetics and machine listening becomes not only an artistic inquiry but a reimagining of how we interact with the world.

卷IV:AI 微声生成的技术逻辑
The Technical Logic of AI Microsound Generation

微声音频生成的独特挑战 / Unique Challenges of Microsound Generation

微声(microsound)指持续时间仅数毫秒的微小音响粒子,其生成面临极端的时域和频域要求:短至微秒级别的时长、极高的采样率和密度才能捕捉到微观细节。研究表明,人类对全局结构和精细的波形连贯性都高度敏感。这意味着微声生成模型必须在样本级别保持时间连续性,同时管理极宽的频谱内容。不像周期乐音那样规律,微声往往缺乏重复模式,更接近随机噪声,要求算法在时间–频率域具有连续性和非周期性特征。简而言之,微声音频的挑战在于极短时间尺度的精确建模、对频谱密度的细致控制,以及在微秒级粒度上实现自然流畅的声音纹理(texture)。

Microsound refers to tiny sound grains lasting only a few milliseconds, posing extreme temporal and spectral demands on generation. Humans are sensitive to both global structure and fine-scale waveform coherence, so models must preserve sample-level continuity and handle very wideband spectra. Unlike periodic tones, microsounds lack repetitive cycles and resemble stochastic noise, requiring continuity in time–frequency representation without simple periodic templates. In short, the unique challenges of microsound generation include modeling at ultra-short time scales, precise control of spectral density, and producing natural, non-periodic “textures” at microsecond resolution.

主流AI方法对微声生成的适配性分析 / Adaptability of Mainstream AI Methods for Microsound Generation

  • WaveNet 等自回归模型 (WaveNet and Autoregressive Models): WaveNet等深度神经网络通过逐样本自回归地生成原始音频波形。它能够捕捉微观波形细节,生成高保真的短时信号,但推理速度慢,且缺少全局潜在结构的建模能力。因此,虽然WaveNet在理论上可用于微声合成,但其计算开销和序列相关性(autoregression)限制了对超短时序信号的效率。
    WaveNet is an autoregressive model that generates raw audio sample by sample. It captures fine waveform details and can produce high-fidelity short sounds, but it is slow and focuses on local structure at the expense of global latent structure. Thus, while WaveNet can in principle handle microsound, its computational cost and sequential nature make ultra-short sound synthesis inefficient.
  • 扩散模型 (Diffusion Models, e.g. Diffsound/Multi-Band Diffusion): 近年来扩散模型在音频生成中崭露头角,比如用于离散音频代币的多频带扩散(Multi-Band Diffusion)。这类方法通过对离散隐空间进行扩散采样,可以产生各种模态的高保真音频。对于微声,扩散模型优势在于非自回归并行生成,理想情况下可控制声音粒子的随机性。但现有工作(如Meta的MBD-EnCodec)主要关注音乐和环境音,并未特化于毫秒级粒度。
    Diffusion models have recently emerged in audio generation, for example using discrete latent codes with multi-band diffusion. Such models sample from discrete representations to generate any type of audio with high fidelity. In microsound synthesis, diffusion methods benefit from parallel, non-autoregressive generation and fine control over randomness. However, current implementations (like Meta’s MBD-EnCodec) focus on music and ambient audio, not specifically on millisecond-scale grains.
  • GAN 模型 (GANs, e.g. GANSynth): GAN通过全局潜在变量并行生成音频,速度快但难以保证局部波形连贯性。GANSynth等工作证明,通过对频谱的对数幅度和瞬时频率建模,GAN能够生成高保真且局部连贯的音频。它们在NSynth数据集上表现优于WaveNet基线,并且比自回归模型快上几个数量级。不过GAN对随机粒子噪声的建模仍存在局限,对极端微观结构(例如极短粒度噪声或非线性失真)还需进一步验证。
    GANs generate audio with global latent conditioning and efficient parallel sampling, but struggle with local coherence. GANSynth shows that by modeling log-magnitude and instantaneous frequency, GANs can produce high-fidelity, locally-coherent audio. In experiments they outperform WaveNet baselines on instrument notes and synthesize audio orders of magnitude faster. However, GANs may still struggle with truly random granular textures and extreme non-linearities at micro scales.
  • Transformer 架构 (Transformer-based Models): 近期尝试将Transformer应用于音频合成,如Google的AudioLM。AudioLM将声音编码为离散的语义令牌(semantic tokens)和声学令牌(acoustic tokens),分别捕捉时频域的局部与全局结构。该模型使用多级Transformer(层叠结构)逐步生成音频,显示出长期一致性和高保真度。对于微声而言,Transformer优势在于建模长序列的能力,以及可以通过层次化令牌捕获不同尺度特征,但对极短粒度建模可能依赖上一级编码器的分辨率。
    Transformer-based audio models like Google’s AudioLM encode sound into hierarchical tokens. AudioLM uses Transformers on semantic tokens (capturing both local and long-term structure) and on acoustic tokens (capturing waveform details). This yields audio with coherent long-range structure and high fidelity. For microsound, Transformer’s strength is modeling long sequences and multi-scale structure, though micro-grain details depend on the granularity of its tokenization layers.
  • 自监督音频模型 (Self-Supervised Models, e.g. EnCodec, AudioMAE): 这类模型通过无标签数据学习音频特征,可作为生成模型的编码器。例如Meta的EnCodec是一种高保真神经音频编码器,使用残差矢量量化生成多路离散音频令牌。各条令牌流捕捉不同层次的信息,可在低码率下重建高质量音频。编码器输出可用于AudioGen/MusicGen等生成任务。音频MAE(AudioMAE)是一种将掩码自编码器方法移植到音频谱图上的模型,它从未标注音频中学习表示,不依赖任何标签。预训练的AudioMAE特征被证明对生成任务有潜力,可表达丰富的语义和结构信息。总的来说,自监督模型提供了学习深层音频特征的手段,但如何在微声合成中解码这些特征仍是开放问题。
    Self-supervised audio models learn representations from unlabeled data and can serve as feature encoders. For example, Meta’s EnCodec is a state-of-the-art neural audio codec with a residual vector-quantized bottleneck producing several streams of discrete audio tokens. These token streams capture different levels of waveform information, enabling high-fidelity reconstruction even at low bitrates. Another example is AudioMAE, an extension of masked autoencoders to audio spectrograms: it learns to reconstruct masked patches from unlabeled audio and can serve as a powerful pre-trained feature extractor. Such features (used in models like AudioLM2) provide rich semantic and acoustic information. In microsound synthesis, these self-supervised encodings could represent fine-grained textures, but integrating them into a generative pipeline remains an active area of research.

数据需求与问题 / Data Requirements and Issues

训练AI微声模型需要超高时域解析度的音频数据。由于微声音频通常包含极宽的频谱,采样率应远超常见的16kHz语音标准,音乐领域常用的48kHz或更高采样率更能保留细节。例如Meta的48kHz立体声模型即能在音乐标准质量下工作。此外,数据粒度上每个音频片段应具有毫秒甚至微秒级长度的切片。理想的数据集应包括丰富的音色、事件和合成噪声样本。标签方面,微声生成不像语音合成依赖文字转录,可能需要标注音源类型、合成方法或音色参数;但这些标签难以人工标注,故常采用无标签或自监督策略。

无标签数据处理策略包括利用自监督学习和数据增强:例如基于AudioMAE的预训练可以在大规模无标注音频上学习特征;扩散模型也可以在无监督情形下学习音频潜在分布。生成模型(如扩散模型和VAE)往往通过大规模混合数据预训练,缓解对精细标签的需求。总之,微声合成训练所需数据应具有超高采样率、精细粒度和多样性,而标注可弱化,通过自监督模型和无监督框架来利用未标注数据。

*Training AI microsound models requires audio with very high temporal resolution. Since microsounds can span a broad spectrum, sampling rates well above typical 16kHz speech (e.g. 48kHz for music) are needed to capture fine details. For example, Meta’s codec uses 48kHz stereo input for standard-quality music. In terms of granularity, datasets should include sound snippets on the order of milliseconds or shorter. Ideally, the dataset covers diverse timbres, noise sources, and synthesis types. As for labels, microsound synthesis does not use text transcripts; instead one might annotate source types, synthesis parameters, or timbre descriptors. However, such fine labels are hard to obtain manually, so unsupervised/self-supervised approaches are favored. Strategies for unlabeled data include self-supervised learning and data augmentation. For instance, AudioMAE can be pre-trained on massive unlabeled audio to learn useful features, and diffusion models can be trained to model audio distributions without labels. Generative models (diffusion, VAEs) often leverage large mixed datasets to reduce reliance on precise annotations. In summary, data for microsound synthesis should be high-sample-rate and fine-grained, with labels treated lightly; large-scale unlabeled audio can be exploited via techniques like masked modeling and self-supervised pretraining.

微观声音结构的AI识别与重建能力 / AI Recognition and Reconstruction of Microsound Structures

当前算法在粒状合成的分析重建方面已有探索。Bitton等人使用变分自编码器学习了一个可逆的“粒状空间”,可以连续合成声音并控制音色变化。这种神经粒状合成方法替代了传统特征描述符,使得对原始微粒(grain)进行连续映射成为可能。对于非线性噪声,AI方法通常难以从数据中“理解”其随机性;常见手段是生成噪声采样或基于统计的噪声合成,但对复杂失真特征的精确重建仍有局限。至于次声/超声频段,由于常用数据集采样率局限(如48kHz通常只保留到~24kHz),这些频段信息在训练中往往丢失或被滤除。若要合成或复原超声信号,需特殊高频采样和频谱外推技术,目前算法针对这一块的能力尚不成熟。总体而言,现有算法能够以自适应生成代替粒状合成,但在极端粒度和非线性噪声处理上仍存在“听觉上可察觉的失真和缺陷”。

Bitton et al. demonstrate that a neural network can implement granular synthesis by learning an invertible latent grain space. They use a VAE to replace handcrafted descriptors, allowing continuous traversal in grain space and sound morphing. For nonlinear noise, AI-generated noise often relies on learned stochastic patterns or procedural textures, but reconstructing specific chaotic textures is difficult. For infrasonic/ultrasonic frequencies, typical models are limited by the Nyquist frequency of training data; generating beyond audible range would require specialized high-rate datasets or physics-based models, which are not widely available yet. In summary, modern algorithms can recognize and synthesize granular textures through learned latent spaces, but they reach limits when dealing with extreme randomness, distortion, or sub-/ultrasonic components, often simplifying these structures into approximate statistical templates.

技术美学视角下的反思 / Reflections from a Techno-Aesthetic Perspective

从技术美学角度看,AI算法“创作”微观声音引发了关于创造性的讨论。严格来说,AI通过数学运算生成新的声波,但其创作过程更多是对训练数据中模式的重组,而非人类意义上的主观意图。微观声音往往被模型简化为可操控的结构模板(如隐空间中的点阵),生成时重现了数据分布中的纹理特征。机器对“质感”的编码实际上是对声音统计特征和频谱纹理的表征,而没有人类对感官体验的理解。因此,我们可以说,AI在生成声音时表现出一种算法上的**“创意”**——它能混合和变形声音碎片—but这种“创作”更多依赖于技术参数和数据驱动的规则,而非自我意识或审美感知。正如研究者指出,人类听觉非常敏感于声音的精细结构,因此算法设计时需要兼顾技术性和美学性,以确保生成的微声在听觉上仍具备丰富质感,而不是仅仅重复已有模式。综上,从美学角度看,微声合成是对声音“质感”的工程化编码,算法创造的细节是否能被人类视为真正的“艺术”成果,仍需在设计与实践中不断探索与反思。

From a techno-aesthetic viewpoint, it is debated whether algorithms truly “create.” AI generates new waveforms mathematically, but its generative process is fundamentally a recombination of patterns learned from data, not driven by human intent. Microsounds often become controllable structural templates (points in a latent space) whose statistical textures are reproduced. A machine’s encoding of “texture” is really a mapping of statistical features and spectral patterns, not a subjective experience. We might say AI shows a form of algorithmic “creativity” – it can blend and transform sound fragments – but this creativity is governed by technical parameters and learned rules rather than conscious artistic vision. As one study notes, human perception is very sensitive to fine-scale audio coherence, so in practice designers must balance technical soundness with aesthetic texture. In sum, from a technical-aesthetic perspective, microsound synthesis is an engineered encoding of sonic “feel”. Whether the algorithm’s intricate outputs count as true art depends on how well these statistically generated textures resonate with human perception and artistic intent.

参考文献:

卷I-III 综合了微声音乐与机器聆听领域的研究成果,例如Curtis Roads对微声的定义、机器聆听的最新研究、以及声音生态学与聆听政治的讨论。这些观点与实践案例共同构建了跨学科的声音研究框架。卷IV 综合了近期学术成果和行业实践,如WaveNet、GANSynth、AudioLM、EnCodec、AudioMAE等模型的研究报告。引用提供了这些工作的技术细节与成果,以支撑上述分析。

Citations

(PDF) On Genre, History, and Invention in the Analysis of Creative Processes in Music
https://www.researchgate.net/publication/370068905_On_Genre_History_and_Invention_in_the_Analysis_of_Creative_Processes_in_Music

Sound as Method: Ecological Practices and the Politics of Listening – Giorgio Mega in Conversation with Manfredi Clemente • Fondazione Studio Rizoma

Sound as Method: Ecological Practices and the Politics of Listening – Giorgio Mega in Conversation with Manfredi Clemente

(PDF) Machine Listening as Sonification
https://www.researchgate.net/publication/387260852_Machine_Listening_as_Sonification

Embodied Exploration of Latent Spaces and Explainable AI
https://arxiv.org/html/2410.14590v1

AI tool helps ecologists monitor rare birds through their songs – BES
https://www.britishecologicalsociety.org/new-deep-learning-ai-tool-helps-ecologists-monitor-rare-birds-through-their-songs/

[1902.08710] GANSynth: Adversarial Neural Audio Synthesis
https://arxiv.org/abs/1902.08710

[1609.03499] WaveNet: A Generative Model for Raw Audio
https://arxiv.org/abs/1609.03499

EnCodec
https://audiocraft.metademolab.com/encodec.html

AudioLM: a Language Modeling Approach to Audio Generation
https://research.google/blog/audiolm-a-language-modeling-approach-to-audio-generation/

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
https://arxiv.org/html/2308.05734v3

[2008.01393] Neural Granular Sound Synthesis
https://arxiv.org/abs/2008.01393

All Sources

researchgate studiorizoma arxiv  bes arxiv audiolm google research

0 Comments

Leave a reply

Your email address will not be published. Required fields are marked *

*

©2026 weme creative agent

Log in with your credentials

Forgot your details?