Before we explain anything about how AI music apps work, we should tell you about Berk, because he's the reason the team finally understood it.

Berk is one of our backend engineers. Last winter he dropped a message in our team Slack at 2am that just read, "I think the model is hallucinating bridges."

He meant the song-bridge kind, not the Brooklyn kind. The output had a verse, a chorus, and then 12 seconds of audio that wandered off into a chord progression nobody on the team had asked for. We laughed about it for a week. Then it stopped being funny. The model was doing exactly what we asked. Generate music. It just had no real concept of a song.

That night we started understanding what we'd actually built. It isn't one model that "makes music." It's three of them, stitched together like a cover band where everyone learned their parts separately and met for the first time onstage. If you've ever pressed generate in Sonx or Suno or Udio, you've felt the result. The first time it works, your brain does a small double-take. After you see the wiring, though, the apps stop being magic and turn into what they actually are. A few research breakthroughs, duct-taped together, with a UX that politely hides the seams.

This is what we wish someone had told us three years ago, when we were still figuring it out.

Why this turned out to be such a brutal problem

Making music from a text description is, weirdly, harder than making an image. A picture has more pixels per second than a song has audio samples, but pixels and samples aren't the same kind of difficulty. We had to have this explained to us a few times before it clicked.

A picture is one moment frozen flat. A song is a thousand of them in sequence, and the sequence has to mean something. The chorus has to come back, and you have to feel it coming back. The snare needs to land in the same spot every bar, like a dripping tap that never misses. The vocal melody has to resolve at the end of a phrase, not somewhere in the middle. When music doesn't do these things, listeners can rarely tell you what's wrong. They just say "this is weird," and they bounce.

Then there's the stack-of-pancakes problem. A pop song isn't one sound. It's vocals, drums, bass, keys, plus a fistful of effects on top, all of it in the same key and tempo. The model has to invent pieces that lock into each other. Last summer we shipped an internal build where the bassline was four cents flat against the vocal. Four cents is barely measurable. Nobody on the team could put it into words. They just said "this song feels weird," which by then was a running joke. Our audio engineer caught it after about half a day with spectrograms. The kind of patience you only get from people who also tune pianos in their spare time.

And then there's the legal layer. Boring topic. Probably the biggest constraint on the whole field. Almost every commercial song belongs to someone with lawyers, and the RIAA is paying close attention. The U.S. Copyright Office puts out policy reports on generative AI that quietly shift every few months, and each shift slightly reshapes what apps are allowed to ship. We probably spend something like 15% of our engineering time just keeping up. Which is too much. But also necessary.

The breakthrough wasn't a single model. It was learning to break the problem into stages, the way a producer would.

What's actually inside the box

Here's what's living inside almost every serious AI music app today, ours included. We're going to skip the math. Partly to keep this readable, partly because we'd embarrass ourselves trying to get it exactly right in public.

First: the model has to read your mind

Your prompt goes to a language model before it goes anywhere near audio. Not to make sound. To make a plan. Tempo, key, which instruments belong in the room, what the lyrics should be about, how the vocal should sit on top, and the song's structure. Verse-chorus-verse-chorus is the default, although hip-hop and ambient and certain dance subgenres want different shapes. (Some of us keep wanting to call this "Stage Zero." The rest of the team won't allow it.)

The output isn't audio. It's a kind of structured brief, the way a producer might scribble notes on a napkin before a session. The closer that brief is to what a real human producer would have written, the better the song lands. Last quarter we tested a handful of competing models on the same 50 prompts, mostly to figure out where we sat against the field. The weakest read prompts like a tax form and missed the subtext entirely. Lo-fi about missing someone came back upbeat, like a jingle for a credit card. The strongest caught the mood. We've put more engineering time into this part than into any other piece of the system, and we still don't think it's anywhere near solved. The prompt the team personally struggles with most is "make it good." We have no idea how to operationalize that. If you ever figure it out, our contact is in the footer.

The same prompt template plays out wildly differently across different genres, and that's mostly the work of this first model.

Then: words, and figuring out how to sing them

If your song has vocals, a separate model takes over and writes the words. It also decides how each line gets sung. Pitch, rhythm, how each phrase shapes itself in the air. These are usually language models that someone took and pointed at a giant pile of song lyrics, until the model started catching things prose models don't bother with. The fact that a verse and a chorus do different jobs. Internal rhyme. How to land an emotion in eight syllables when you have forty syllables' worth of feeling. We ended up building our own dataset for this, partly because the public ones were too small, partly because lyrics datasets have, let's call them, complicated copyright situations.

The melody side is its own little world. The model picks which notes the singer hits and how long the singer holds each one. All of it has to fit the chord progression that came out of the first model. When the melody misses by a little, the vocal floats above the music in a way that's hard to articulate. It sounds like singing. It just doesn't sound like singing this song.

Why early AI music sounded weird: in 2022 most models tried to do all of this in one pass. Vocals would drift in and out of tune, because the same model that was choosing the notes was also rendering the audio, and it didn't really know where the beat was. We pulled the two jobs apart into different models. Took two rebuilds and a long argument with a former colleague, but it was the right call.

Finally: making air actually move

Now the planning is done. Now we have to build sound.

There are basically two ways the field does this in 2026, and most apps mix them like ingredients in a blender. Diffusion models start with a bath of pure noise (literally just random samples, the audio equivalent of static on an old TV) and slowly denoise it into a waveform that matches the plan. Each step nudges a little randomness out and a little structure in. It's the same family of models that draws images in Stable Diffusion. Meta's research team described an early version in the MusicGen paper, and the technique has gotten a lot more polished since. Diffusion sounds warm. The same way a vinyl record sounds warmer than the same song streamed on Spotify, even when the source files are technically identical. The downside is it's slow and expensive to run.

Transformer models go the other way around. They generate audio one tiny chunk at a time, the way a language model generates text one word at a time, except each "word" here is a small slice of waveform. Google's research team published an early version called MusicLM that shaped a lot of what came after. Transformers are better at long-range structure. The chorus feels like a chorus, instead of a different chorus.

Most production apps in 2026 use a hybrid. Diffusion for the texture of sound, transformer logic for structural memory. We do. The math gets ugly fast, but if you want to go deeper, Wikipedia's page on diffusion models is, weirdly, one of the better-written things on the topic.

The 3-stage AI music generation pipeline: prompt understanding, lyrics and vocal melody, audio synthesis
A simplified view: text understanding, then lyrics and melody planning, then audio synthesis. Each stage is a different specialized model, usually running in parallel or in tight succession.

About voice cloning, briefly

Voice cloning is the feature people have the most questions about, and also the most worry, and we think both reactions make sense.

The idea is almost embarrassingly simple. You record 10 to 30 seconds of a voice. The model learns the timbre, the pitch range, the small grain that makes a particular voice sound like a particular person. Then it sings lyrics it has never heard. Same family of architecture that lets a chatbot speak in a custom voice, adapted for singing, which turns out to be a brutal jump in difficulty. Singing involves sustained pitch, vibrato, and the kind of emotional shaping that nobody puts into a sentence like "your meeting starts in five minutes." (Real talk. The first version of our voice clone did fine on speech and produced something on the singing side that the team genuinely christened "drunk choir." We tore it down and started over.)

If you want a deeper primer on the underlying tech, the Wikipedia entry on speech synthesis is solid. Or just take our word for it.

We treat this carefully. On Sonx you can only clone your own voice. Consent is recorded inside the app, with a verification step that took longer to build than we expected. We won't accept uploads of celebrity voices, or anyone you can't prove you are. This isn't only ethics, though that's plenty by itself. It's also App Store policy. Apple has been getting stricter about voice-cloning since 2024 and we'd be off the store within a week if we got something wrong.

A deeper post about how the consent flow actually works (what's stored, what isn't, on-device verification, the audit trail) is something we'll publish in a few weeks. For now: voice cloning is a tool, the same way a knife is a tool. Whether the tool is useful or dangerous depends almost entirely on the gates you put around it.

Why two AI music apps don't sound the same

If you've fed identical prompts into two AI music apps and gotten different songs back, you've already done half this section yourself. Here's what makes the gap, in the order our team usually notices it during competitive testing.

Training data is the loudest variable. Apps trained on cleaner, bigger, and more genre-diverse libraries make more believable music. This is also the part of the stack with the most lawyers attached, which is why none of us publish exactly what we trained on. We have very good reasons. So does every competitor.

Model size matters too, but with the kind of diminishing returns that haunt every AI team. Bigger models all else equal sound more natural. They're also slower and cost more to run. Almost every app has settled on the same compromise: a smaller, fast model for instant generations, a larger one for the premium tier or longer renders. The arithmetic is grim. Doubling the size of a model buys you a 10 to 15 percent perceived-quality bump and roughly doubles the compute bill. Anyone selling you a clean win there is selling you something.

Post-processing is the part nobody talks about, and probably the easiest tell of who actually has audio engineers on staff. Mixing, mastering, noise reduction, the final EQ pass. All of it happens after the model is done. A great-sounding AI track often owes a quarter of its perceived quality to these final touches. This is where a lot of new AI music apps get caught. Decent model, mediocre output, because nobody on the team knows what a low-shelf cut at 80Hz does.

And then there's a fourth factor that almost nobody talks about. Prompt UX. The apps that quietly upgrade your prompt by adding genre tags you didn't type, suggesting tempo, translating vague moods like "moody" or "epic" into specific musical choices, end up making better songs than apps that pass your raw text straight to the model. The user thinks the model is smarter. Mostly it's the app doing more homework before the model ever sees the request.

Try the pipeline yourself

Sonx runs the full three-stage pipeline on your phone. Text-to-song, optional voice cloning, lyrics, music video. Free on iOS and Android.

If you're picking one

We do not envy anyone trying to choose between AI music apps right now. There are too many, and the marketing copy is mostly identical. So here's roughly what we look at when we're testing competitors internally, in the order we tend to bump into the answers.

How long does it take to hear a song. The first time you open an app, you should be playing back something inside 90 seconds. If the onboarding flow makes you create an account, watch a tutorial, and pick from 30 different genres before anything plays, the team optimized for the wrong thing. (We've made this mistake too, and felt the bounce in our analytics within a day.)

How well it follows your prompt. Type something specific and a little tricky. "Synthwave with female vocals about driving home at 3am." Did you get synthwave or just generic 80s? Female vocals or whatever default voice the app had loaded? This one prompt tells you most of what you need to know about how good the planner model is.

How natural the vocal sounds. Listen for the things that are hard to fake. Sustained notes. Breath between phrases. Consonants at the end of words, especially soft S sounds. Robotic vocals show their seams on long held notes the way a budget hairpiece shows itself in good lighting.

Whether you can edit. Can you regenerate just the chorus? Swap one instrument? Extend a section without nuking the rest of the track? Apps that let you iterate produce songs you'd actually keep. Apps that force you to start from zero produce one-time party tricks. We pushed our editable outputs into beta last month, partly because we were tired of re-rolling our own demos eight times.

Read the terms of service. Nobody reads terms of service. We know. Read this one anyway. Some apps grant you full commercial rights to whatever you make. Others quietly retain a license. If you have any plans to put the song anywhere outside the app, this matters more than people expect.

We'll do a head-to-head walkthrough of how the major apps actually stack up on each of these in our next post, with audio you can compare side by side.

What we think happens next

A few things we're watching for over the next year, in roughly the order we think they'll land.

Real-time generation. Today a two-minute song renders in 30 to 60 seconds. By the end of 2026, the gap will close enough that generation can keep pace with playback. AI music that responds to what you're doing while you're doing it, the way a film score reacts to a scene. The research is already there. Engineering is catching up.

Editable outputs. Right now, generating a song is mostly a one-shot affair. The next wave of apps will let you regenerate just the chorus, swap an instrument, or extend a section without disturbing what's around it. The underlying technique, sometimes called guided diffusion, is ready to ship. We're working on a version that's a little better than ready.

Vocals that no longer sound like vocals from an AI. The vocal quality in 2026 is good. It still has tells, like slight robotic edges, occasional wrong syllable stresses, breaths in places no human singer would put a breath. Newer architectures specifically tuned for singing voice synthesis are closing this gap quickly. Our guess is that by mid-2027, the average listener won't pick the AI vocal out of a lineup more often than chance.

And the bigger shift, which is the one we're most uncertain about. Multimodal generation. Apps that produce a song and a music video together, where the video is timed to the song's structure instead of pasted on top after the fact. We already ship this in beta. The rest of the field is moving the same direction. The harder question, which honestly we don't have a confident answer to, is whether the video model and the music model will share enough information for the result to feel like one piece of art instead of two things stapled together by a deadline. We lean optimistic but our track record on timing predictions is bad enough that you should heavily discount us.

TL;DR (with feelings)

The field is in a strange spot, if we're being honest about it. The pipeline we described above is settled. The improvements over the next year are incremental. Vocals get less robotic. Latency gets shorter. You can edit instead of starting over. After that, the question stops being whether AI can make a believable song. That fight is over. The question becomes what people will use the tool for once making a song stops being the bottleneck on creating something good. We don't have a confident answer for that. Nobody we trust does either.

If you want to play with the result, Sonx is free on iOS and Android. If you want to keep nerding out about this stuff, our journal will keep going deeper. Coming up is a head-to-head comparison of the big AI music apps, and a guide to writing prompts that produce songs you'd actually want to listen to twice.

FAQ

Is AI-generated music copyrighted?
In the US, you can't copyright a song the AI made on its own. The U.S. Copyright Office has been pretty firm about that one. What you actually own depends on the app you used. Some grant you full commercial rights, some quietly keep a license on what you make. Read the fine print before you stick anything on Spotify.
How long does AI music take to generate?
Usually 20 to 60 seconds for a 2 to 3 minute song. Length matters most. Server load and which model tier you're on also factor in. We've seen our own renders come back in 15 seconds on a quiet weekday evening, and closer to 90 on a Friday night when half the planet is also pressing generate.
Can AI clone my voice?
Yes, with most modern apps. The catch: any app worth using will only let you clone your own voice. You provide the sample, you record consent on the spot. Don't trust an app that lets you upload someone else's voice without checks. We don't, and the App Store wouldn't let us if we tried.
What's the difference between diffusion and transformer-based music models?
Diffusion starts with noise and chips away at it until music falls out. It sounds warmer. Transformers build the song one tiny chunk of audio at a time, which gives them better structural memory (the chorus actually feels like a chorus, not a slightly different chorus). Most apps in 2026 use a hybrid of both.