Tutorials

T1: Connecting Music Audio and Natural Language

Presenters : Seung Heon Doh, Ilaria Manco, Zachary Novack, Jong Wook Kim and Ke Chen

Abstract: Language serves as an efficient interface for communication between humans as well as between humans and machines. Through the integration of recent advancements in deep learning-based language models, the understanding, search, and creation of music is becoming capable of catering to user preferences with better diversity and control. This tutorial will start with an introduction to how machines understand natural language, alongside recent advancements in language models, and their application across various domains. We will then shift our focus to MIR tasks that incorporate these cutting-edge language models. The core of our discussion will be segmented into three pivotal themes: music understanding through audio annotation and beyond, text-to-music retrieval for music search, and text-to-music generation to craft novel sounds. In parallel, we aim to establish a solid foundation for the emergent field of music-language research, and encourage participation from new researchers by offering comprehensive access to 1) relevant datasets, 2) evaluation methods, and 3) coding best practices. 

Bios

SeungHeon Doh is a Ph.D. student at the Music and Audio Computing Lab, KAIST, under the guidance of Juhan Nam. His research focuses on conversational music annotation, retrieval, and generation. SeungHeon has published papers related to music & language models at ISMIR, ICASSP and IEEE TASLP. He aims to enable machines to comprehend diverse modalities during conversations, thus facilitating the understanding and discovery of music through dialogue. SeungHeon has interned at Adobe Research, Chartmetric, NaverCorp, and ByteDance, applying his expertise in real-world scenarios.

Ilaria Manco is a Ph.D. student at the Centre for Doctoral Training in Artificial Intelligence and Music (Queen Mary University of London), under the supervision of Emmanouil Benetos, George Fazekas, and Elio Quinton (UMG). Her research focuses on multimodal deep learning for music information retrieval, with an emphasis on audio-and-language. Her contributions to the field have been published at ISMIR and ICASSP and include the first captioning model for music, and representation learning approaches to connect music and language for a variety of music understanding tasks. Previously, she was a research intern at Google DeepMind, Adobe and Sony, and obtained an MSci in physics from Imperial College London. 

Zachary Novack is a Ph.D. Student at the University of California -- San Diego, where he is advised by Julian McAuley and Taylor Berg-Kirkpatrick. His research is primarily aimed at controllable music and audio generation. Zachary seeks to build generative music models that allow for arbitrary musically-salient control mechanisms and enable stable multi-round generative audio editing, publishing such work at ICML, ICLR, and NeurIPS. Zachary has interned at Adobe Research, contributing such works as DITTO to be deployed in end-user applications. Outside of academics, Zachary is passionate about music education and teaches percussion in the southern California area.

Jongwook Kim is a Member of Technical Staff at OpenAI where he has worked on multimodal deep learning models such as Jukebox, CLIP, Whisper, and GPT-4. He has published at ICML, CVPR, ICASSP, IEEE SPM, and ISMIR, and he co-presented a tutorial on self-supervised learning at the NeurIPS 2021 conference. He completed a Ph.D. in Music Technology at New York University with a thesis focusing on automatic music transcription, and he has an M.S. in Computer Science and Engineering from the University of Michigan, Ann Arbor. He interned at Pandora and Spotify during the Ph.D. study, and he worked as a software engineer at NCSOFT and Kakao.

Ke Chen is a Ph.D. Candidate in the department of computer science and engineering at University of California San Diego. His research interests span across the music and audio representation learning, with a particular focus on its downstream applications of music generative AI, audio source separation, multi-modal learning, and music information retrieval. He has interned at Apple, Mitsubishi, Tencent, Bytedance, and Adobe, to further explore his research directions. During his PhD study, Ke Chen has published more than 20 papers in top-tier conferences in the fields of artificial intelligence, signal processing, and music, such as AAAI, ICASSP, and ISMIR. Outside of academics, he indulges in various music-related activities, including piano performance, singing, and music composition.

T2: Exploring 25 Years of Music Information Retrieval: Perspectives and Insights

Presenters: Masataka Goto, Jin Ha Lee, and Meinard Muller

Abstract: This tutorial reflects on the journey of Music Information Retrieval (MIR) over the last 25 years, offering insights from three distinct perspectives: research, community, and education. Drawing from the presenters' personal experiences and reflections, it provides a holistic view of MIR's evolution, covering historical milestones, community dynamics, and pedagogical insights. Through this approach, the tutorial aims to give attendees a nuanced understanding of MIR’s past, present, and future directions, fostering a deeper appreciation for the field and its interdisciplinary and educational aspects.

The tutorial is structured into three parts, each based on one of the aforementioned perspectives. The first part delves into the research journey of MIR. It covers the inception of query-by-humming and the emergence of MP3s, discusses the establishment of standard tasks such as beat tracking and genre classification, and highlights significant advancements, applications, and future challenges in the field. The second part explores the community aspect of ISMIR. It traces the growth of the society from a small symposium to a well-recognized international community, emphasizing core values such as interdisciplinary collaboration and diversity, and invites the audience to imagine the future of the ISMIR community together. Lastly, the third part discusses the role of music as an educational domain. It examines the broad implications of MIR research, the value of pursuing a PhD in MIR, and the significant educational resources available.

Each part invites audience interaction, aiming to provide attendees with a deeper appreciation of MIR's past achievements and insights into its potential future directions. This tutorial is not just a historical overview but also a platform for fostering a deeper understanding of the interplay between technology and music.

Bios

Masataka Goto received the Doctor of Engineering degree from Waseda University, Tokyo, Japan, in 1998. He is currently a Principal Researcher at the National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Japan. In 1992 he was one of the first to start working on automatic music understanding and has since been at the forefront of research in music technologies and music interfaces based on those technologies. Over the past 32 years he has published more than 300 papers in refereed journals and international conferences and has received 68 awards, including several best paper awards, best presentation awards, the Tenth Japan Academy Medal, and Tenth JSPS PRIZE. He has served as a committee member of over 120 scientific societies and conferences, including the General Chair of ISMIR 2009 and 2014, the Program Chair of ISMIR 2022, and the Member-at-large of the ISMIR Board from 2009 to 2011. As the research director, he began the OngaACCEL project in 2016 and the RecMus project in 2021, which are five-year JST-funded research projects (ACCEL and CREST) related to music technologies. He gave tutorials at major conferences, including ISMIR 2015, ACM Multimedia 2013, ICML 2013, ICPR 2012, and ICMR 2012.

Jin Ha Lee is a Professor and the Founder and Director of the GAMER (GAME Research) Group at the University of Washington Information School. She holds an M.S. (2002) and a Ph.D. (2008) in Library and Information Science from the University of Illinois at Urbana-Champaign. Her research focuses on exploring new ideas and approaches for organizing and providing access to popular music, multimedia, and interactive media, understanding user behavior related to the creation and consumption of these media, and using these media for informal learning in venues such as libraries and museums. She has been actively engaging with the ISMIR community from the early days of ISMIR, and was at the forefront of user-centered MIR research at ISMIR, contributing a number of papers on user perception of music similarity and mood, music listening and sharing behavior, cross-cultural aspects of MIR, and human-AI collaboration. She served as the Secretary of the ISMIR Board from the inception to 2015, and also as the General Co-Chair of ISMIR 2021, and the Scientific Program Co-Chair of ISMIR 2014, 2020, and 2024. She also serves as an Editorial Board Member for the Transactions of the International Society for Music Information Retrieval. 

Meinard Müller received the Diploma degree (1997) in mathematics and the Ph.D. degree (2001) in computer science from the University of Bonn, Germany. Since 2012, he has held a professorship for Semantic Audio Signal Processing at the International Audio Laboratories Erlangen, a joint institute of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and the Fraunhofer Institute for Integrated Circuits IIS. His recent research interests include music processing, music information retrieval, audio signal processing, and motion processing. He was a member of the IEEE Audio and Acoustic Signal Processing Technical Committee (2010-2015), a member of the Senior Editorial Board of the IEEE Signal Processing Magazine (2018-2022), and a member of the Board of Directors, International Society for Music Information Retrieval (2009-2021, being its president in 2020/2021). In 2020, he was elevated to IEEE Fellow for contributions to music signal processing. Currently, he also serves as Editor-in-Chief for the Transactions of the International Society for Music Information Retrieval (TISMIR). Besides his scientific research, Meinard Müller has been very active in teaching music and audio processing. He gave numerous tutorials at major conferences, including ICASSP (2009, 2011, 2019) and ISMIR (2007, 2010, 2011, 2014, 2017, 2019, 2023). Furthermore, he wrote a monograph titled “Information Retrieval for Music and Motion” (Springer 2007) as well as a textbook titled “Fundamentals of Music Processing” (Springer-Verlag 2015).

T3: From White Noise to Symphony: Diffusion Models for Music and Sound

Presenters: Chieh-Hsin Lai, Koichi Saito, Bac Nguyen Cong, Yuki Mitsufuji, and Stefano Ermon

Abstract: This tutorial will cover the theory and practice of diffusion models for music and sound. We will explain the methodology, explore its history, and demonstrate music and sound-specific applications such as real-time generation and various other downstream tasks. By bridging the gap from computer vision techniques and models, we aim to spark further research interest and democratize access to diffusion models for the music and sound domains. 

The tutorial comprises four sections. The first provides an overview of deep generative models and delves into the fundamentals of diffusion models. The second section explores applications such as sound and music generation, as well as utilizing pre-trained models for music/sound editing and restoration. In the third section, a hands-on demonstration will focus on training diffusion models and applying pre-trained models for music/sound restoration. The final section outlines future research directions.

We anticipate that this tutorial, emphasizing both the foundational principles and practical implementation of diffusion models, will stimulate interest among the music and sound signal processing community. It aims to illuminate insights and applications concerning diffusion models, drawn from methodologies in computer vision.

Bios

Chieh-Hsin Lai earned his Ph.D. in Mathematics from University of Minnesota in 2021. Currently, he is a research scientist at Sony AI and a visiting assistant professor at the Department of Applied Mathematics of National Yang Ming Chiao Tung University, Taiwan. His expertise is in deep generative models, especially diffusion models and its application for media content restoration. He has organized an EXPO workshop at NeurIPS 2023 on “Media Content Restoration and Editing with Deep Generative Models and Beyond”. Please refer here for his detailed information https://chiehhsinjesselai.github.io/. 

Koichi Saito is an AI engineer at Sony AI. He has been working on deep generative models for music and sound, especially, solving inverse problems for music signals based on diffusion models and diffusion-based text-to-sound generation. He has extensive experience in showcasing advanced diffusion model technologies to businesses and industries related to music. 

Bac Nguyen Cong earned his M.Sc. degree (summa cum laude) in computer science from Universidad Central de Las Villas in 2015, followed by a Ph.D. from Ghent University in 2019. He joined Sony in 2019, focusing his research on representation learning, vision-language models, and generative modeling. With four years of hands-on professional industry experience in deep learning and machine learning, his work spans various application domains, such as text-to-speech and voice conversion, showing his important contributions to the field.

Yuki Mitsufuji holds dual roles at Sony, leading two departments, and is a specially appointed associate professor at TokyoTech, where he lectures on generative models. He's achieved Senior Member status in IEEE and serves on the IEEE AASP Technical Committee 2023-2026. He chaired  ``Diffusion-based Generative Models for Audio and Speech'' at ICASSP 2023 and ``Generative Semantic Communication: How Generative Models Enhance Semantic Communications'' at ICASSP 2024. Please refer here for his detailed information https://www.yukimitsufuji.com/.

Stefano Ermon is an associate professor at Stanford, specializing in probabilistic data modeling with a focus on computational sustainability. He has received Best Paper Awards from ICLR, AAAI, UAI, CP, and an NSF Career Award. He also organized a course on Diffusion Models at SIGGRAPH 2023. Please refer here for his detailed information https: //cs.stanford.edu/~ermon/.

Afternoon Session

T4: Humans at the Center of MIR: Human-subjects Research Best Practices

Presenters: Claire Arthur, Nat Condit-Schultz, David R. W. Sears, John Ashley Burgoyne, and Josuha Albrecht

Abstract: In one form or another, most MIR research depends on the judgment of humans. Humans provide our ground-truth data, whether through explicit annotation or through observable behavior (e.g., listening histories); Humans also evaluate our results, whether in academic research reports or in the commercial marketplace. Will users like it? Will customers buy it? Does it sound good? These are all critical questions for MIR researchers which can only be answered by asking people. Unfortunately, measuring and interpreting the judgments and experiences of humans in a rigorous manner is difficult. Human responses can be fickle, changeable, and inconsistent—they are, by definition, subjective. There are many factors that influence human responses, some of which can be controlled or accounted for in experimental design, and others which must be tolerated but ameliorated through statistical analysis. Fortunately, researchers in the field of behavioral psychology have amassed extensive expertise and institutional knowledge related to the practice and pedagogy of human-subject research, but MIR researchers receive little exposure to research methods involving human subjects. This tutorial, led by MIR researchers with training (and publications) in psychological research, aims to share these insights with the ISMIR community. The tutorial will introduce key concepts, terminology, and concerns in carrying out human-subject research, all in the context of MIR. Through the discussion of real and hypothetical human research, we will explore the nuances of experiment and survey design, stimuli creation, sampling, psychometric modeling, and statistical analysis. We will review common pitfalls and confounds in human research, and present guidelines for best practices in the field. We will also cover fundamental ethical and legal requirements of human research. Any and all ISMIR members are welcome and encouraged to attend: it is never too early, or too late, in one’s research career to learn (or practice) these essential skills.

Bios

Claire Arthur is an assistant professor in the School of Music and co-director of the Computational and Cognitive Musicology Lab at the Georgia Institute of Technology, and adjunct faculty in the School of Psychology. She received her PhD in music theory and cognition from Ohio State University under David Huron. Her research largely focuses on modeling musical structure from a statistical perspective, as well as examining the cognitive and behavioral correlates of those structures, especially as it relates to musical expectations and emotional responses. Her MIR-related research interests lie in the intersection of music perception, computational musicology, and emotion prediction, with an emphasis on melody, voice-leading, and harmony. 

Nat Condit-Schultz is a Lecturer and the Director of the Graduate Program for the Georgia Tech School of Music. Nat is a musician, composer, and scientist, specializing in the statistical modeling of musical structure. Nat directs the Georgia Tech rock and pop bands, and teaches courses in research methodology, music psychology, and music production. Nat’s research interests include rhythm and tonality in popular music, the perceptual and structural roles of language and lyrics in music, and the music theory of hip-hop. Nat is a performer and composer, specializing in electric and classical guitar: as a composer, he specializes in imitative counterpoint and complex rhythmic/metric ideas like polyrhythm, “tempo spirals,” and irama, realized through classical guitar, rock instrumentation, and Indonesian Gamelan. 

David Sears is Associate Professor of Interdisciplinary Arts and Co-Director of the Performing Arts Research Lab at Texas Tech University, where he teaches courses in arts psychology, arts informatics, and music theory. His current research examines the structural parallels between music and language using both behavioral and computational methods, with a particular emphasis on the many topics associated with pitch structure, including scale theory, tonality, harmony, cadence, and musical form. He also has ancillary interests in music on the global radio, music and emotion, and cross-cultural research. Recent publications appear in his Google Scholar profile. 

John Ashley Burgoyne is Assistant Professor in Computational Musicology at the University of Amsterdam, teaching in the Musicology and Artificial Intelligence and conducting research in the Language and Music Cognition unit at the Institute for Logic, Language, and Computation. His current research focuses on using psychometric approaches in combination with representations and embeddings from deep learning models to improve the interpretability of AI models and flexibility in the design of musical stimuli and experiments. As director of the Amsterdam Music Lab, he is also interested in citizen science and online experimentation, and leads a team developing the MUSCLE infrastructure for facilitating online experiments requiring fine control of audio and music. 

Joshua Albrecht is an Assistant Professor of Music Theory at the University of Iowa, and directs the Iowa Cognitive and Empirical Musicology lab. His current research blends statistical and computational musical analysis with behavioral studies to model listeners’ perception of musical affect, melodic and harmonic complexity, and intonation. Working in a traditional School of Music, his research also focused on applying computational methods to traditional historical and analytical problems, using compositional output as proxies for investigating the cognition of historical compositional practices.

T5: Deep Learning 101 for Audio-based MIR

Presenters: Geoffroy Peeters, Gabriel Meseguer Brocal, Alain Riou, and Stefan Lattner

Abstract: Audio-based MIR (MIR based on the processing of audio signals) covers a broad range of tasks, including analysis (pitch, chord, beats, tagging), similarity/cover identification, and processing/generation of samples or music fragments. A wide range of techniques can be employed for solving each of these tasks, spanning from conventional signal processing and machine learning algorithms to the whole zoo of deep learning techniques.

This tutorial aims to review the various elements of this deep learning zoo commonly applied in Audio-based MIR tasks. We review typical audio front-ends (such as waveform, Log-Mel-Spectrogram, HCQT, SincNet, LEAF, quantization using VQ-VAE, RVQ), as well as projections (including 1D-Conv, 2D-Conv, Dilated-Conv, TCN, WaveNet, RNN, Transformer, Conformer, U-Net, VAE), and examine the various training paradigms (such as supervised, self-supervised, metric-learning, adversarial, encoder-decoder, diffusion). Rather than providing an exhaustive list of all of these elements, we illustrate their use within a subset of (commonly studied) Audio-based MIR tasks such as multi-pitch/chord-estimation, cover-detection, auto-tagging, source separation, music-translation or music generation. This subset of Audio-based MIR tasks is designed to encompass a wide range of deep learning elements. For each tack we address a) the goal of the tasks, b) how it is evaluated, c) provide some popular datasets to train a system, and d) explain (using slides and pytorch code) how we can solve it using deep learning.

The objective is to provide a 101 lecture (introductory lecture) on deep learning techniques for Audio-based MIR. It does not aim at being exhaustive in terms of Audio-based MIR tasks nor on deep learning techniques but to provide an overview for newcomers to Audio-Based MIR on how to solve the most common tasks using deep learning. It will provide a portfolio of codes (Colab notebooks and Jupyter book) to help newcomers achieve the various Audio-based MIR Tasks.

Bios

Geoffroy Peeters is a full professor in the Image-Data-Signal (IDS) department of Télécom Paris. Before that (from 2001 to 2018), he was Senior Researcher at IRCAM, leading research related to Music Information Retrieval. He received his Ph.D. in signal processing for speech processing in 2001 and his Habilitation (HDR) in Music Information Retrieval in 2013 from the University Paris VI. His research topics concern signal processing and machine learning (including deep learning) for audio processing, with a strong focus on music. He has participated in many national or European projects, published numerous articles and several patents in these areas, and co-authored the ISO MPEG-7 audio standard. He has been co-general-chair of the DAFx-2011 and ISMIR-2018 conferences, member and president of the ISMIR society, and is the current AASP review chair for ICASSP. At Telecom-Paris, he created the 40-hour program "Audio and Music Information Retrieval" for the Master-2 level "Data Science" which deals mostly with deep learning applied to MIR that inspired this tutorial.

Gabriel Meseguer Brocal is a research scientist at Deezer with over two years of experience at the company. Before joining Deezer, he completed postdoctoral research at Centre National de la Recherche Scientifique (CNRS) in France. In 2020, he earned his Ph.D. in Computer Science, Telecommunications, and Electronics with a focus on the Sciences \& Technologies of Music and Sound at IRCAM. His research interests include signal processing and deep learning techniques for music processing, with a focus on areas such as source separation, dataset creation, multi-tagging, self-supervised learning, and multimodal analysis.

Alain Riou is a PhD student working on self-supervised learning of musical representations at Télécom-Paris and Sony CSL Paris, under the supervision of Stefan Lattner, Gaëtan Hadjeres and Geoffroy Peeters. Before that, he obtained a master degree in mathematics for machine learning at Ecole Normale Supérieure de Cachan (2020) and another one in signal processing and computer science applied to music at IRCAM (2021). His main research interests are related to deep representation learning, with a strong focus on self-supervised methods for music information retrieval and controllable music generation. His work "PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective" received the Best Paper Award at ISMIR 2023.

Stefan Lattner serves as a researcher leader at the music team at Sony CSL Paris, where he focuses on generative AI for music production, music information retrieval, and computational music perception. He earned his PhD in 2019 from Johannes Kepler University (JKU) in Linz, Austria, following his research at the Austrian Research Institute for Artificial Intelligence in Vienna and the Institute of Computational Perception Linz. His studies centered on the modeling of musical structure, encompassing transformation learning and computational relative pitch perception.  His current interests include human-computer interaction in music creation, live staging, and information theory in music. He specializes in generative sequence models, computational short-term memories, (self-supervised) representation learning and musical audio generation. In 2019, Lattner received the best paper award at ISMIR for his work, “Learning Complex Basis Functions for Invariant Representations of Audio.”

T6: “Lyrics and Singing Voice Processing in Music Information Retrieval: Analysis, Alignment, Transcription and Applications"

Presenters: Daniel Stoller, Emir Demirel, Kento Watanabe, and Brendan O’Connor

Abstract: Singing, a universal human practice, intertwines with lyrics to form a core part of profound musical experiences, conveying emotions, narratives, and real-world connections. This tutorial explores the commonly used techniques and practices in lyrics and singing voice processing, which are vital in numerous music information retrieval tasks and applications.

Despite the importance of song lyrics in MIR and the industry, high-quality paired audio & transcript annotations are often scarce. In the first part of this tutorial, we'll delve into automatic lyrics transcription and alignment techniques, which significantly reduce the annotation cost and enable more performant solutions. Our tutorial provides insights into the current state-of-the-art methods for transcription and alignment, highlighting their capabilities and limitations while fostering further research into these systems.

Moreover, we present "lyrics information processing", which encompasses lyrics generation and leveraging lyrics to discern musically relevant aspects such as emotions, themes, and song structure. Understanding the rich information embedded in lyrics opens avenues for enhancing audio-based tasks by incorporating lyrics as supplementary input. 

Finally, we discuss singing voice conversion as one such task, which involves the conversion of acoustic features embedded in a vocal signal, often relating to timbre and pitch. We explore how lyric-based features can facilitate a model's inherent disentanglement between acoustic and linguistic content, which leads to more convincing conversions. This section closes with a brief discussion on the ethical concerns and responsibilities that should be considered in this area.

This tutorial caters especially to new researchers with an interest in lyrics and singing voice modeling, or those involved in improving lyrics alignment and transcription methodologies. It can also inspire researchers to leverage lyrics for improved performance on tasks like singing voice separation, music and singing voice generation, and cover song and emotion recognition.

Bios

Daniel Stoller is a research scientist at MIQ, the music intelligence team at Spotify. He obtained his PhD from Queen Mary University in 2020, before researching causal machine learning at the German center for neurodegenerative diseases (DZNE). Experienced in audio source separation as well as generative modeling and representation learning, he develops machine learning models and techniques scalable to high-dimensional data such as raw audio signals, publishing in both machine learning and audio-related venues. With a special passion for music, he also worked extensively on lyrics alignment, and singing voice processing including separation, detection and classification.

Emir Demirel is a Senior Data Scientist at Music.ai / Moises, leading projects on lyrics and vocal processing. He obtained his Ph.D. at Queen Mary University of London, as a fellow to the "New Frontiers in Music Information Processing '' project under EU’s Marie Curie/Skladowska Actions. After completing his Ph.D, he joined Spotify’s Music Intelligence team, enhancing his expertise before moving to Music.ai. His research interests span lyric transcription and alignment, speech recognition, and natural language processing, along with generative AI models. 

Kento Watanabe is a senior researcher at the National Institute of Advanced Industrial Science and Technology (AIST), Japan. He received his Ph.D. from Tohoku University in 2018, and his work focuses on Lyrics Information Processing (LIP), natural language processing, and machine learning. He aims to bridge the gap between humans and computers in the field of music and language, and to improve interactions through advanced algorithms.

Brendan O’Connor has worked in music as a performer, composer, producer, teacher, and sound installation artist. He earned his Bachelor’s in classical music at the MTU Cork School of Music (Ireland), followed by his Master’s in music technology at the University of West London, specialising in the voice as the principal instrument in electroacoustic compositions. He then worked towards his Ph.D. in the field of singing voice conversion via neural networks at Queen Mary University of London. His research interests include the disentanglement of scarcely labelled vocal attributes, such as singing techniques. After completing his PhD, Brendan began working for a startup company in voice conversion, allowing him to continue working in his area of expertise with other researchers of the same field using SOTA machine learning techniques.