I can usually tell in the first week whether an AI transformation will work. Not from the technology. From the setup around it.
More than 80 percent of AI projects fail, by RAND's account roughly twice the rate of IT projects that do not involve AI, and the failures are rooted in organizational and leadership decisions, not model performance. If the failures are organizational, then the signals that predict them are organizational too. Which means they are visible before a single model is chosen, in the way the work is set up.
There are five I look for. None of them is about the technology. Each is a specific question I can answer in a thirty-minute conversation, and each separates the initiatives that compound from the ones that stall.
1. Is there a named sponsor, and is there calendar time against it?
The wrong question is "do you have executive support?" Everyone has executive support. Support is free to claim and means nothing.
The question I actually ask is narrower. Who is the named sponsor, and when do they next have time on their calendar for this? I am listening for a name and a recurring slot. No name, and there is no sponsor, only "leadership is behind it." A name but no calendar time, and the support is nominal, which evaporates the first time the work competes with something urgent.
The strong answer is specific. A named person, a standing slot, and a decision that already cost something: a senior person reassigned, a competing priority killed to make room. Sponsorship is measured in calendar time and hard choices, not statements of belief. If you cannot name the sponsor and the slot, you do not have a sponsor. You have encouragement.
2. Can they define success as a number, not a feeling?
If the only word for success is "faster," there is no definition of success.
So I ask: if this works, what number moves, by how much, by when, and who notices? The weak answers are the abstract ones. More efficient. More productive. Faster delivery. Those are feelings, not targets, and an initiative measured in feelings will be declared a success no matter what it delivers, because nothing was ever specified that could disprove it. That is not an accident. "Faster" is the answer that lets everyone avoid accountability for an outcome.
The strong answer names a metric at the level of delivery, not individual time saved, with a target, a date, and an owner. Cycle time on a specific workflow, from x to y, by Q3, owned by a named lead. That is something the team can be wrong about, which is exactly why it predicts success. A metric can be wrong. "Faster" never can.
3. Was the first use case chosen to learn or to be seen?
The first use case tells you whether the organization is trying to learn something or trying to look good. Those choices lead to different places.
I ask two questions. Why this use case first, and what will you know after it that you do not know now? The answer reveals the motive. Optics use cases get chosen because they will demo well to the board, or because a senior person is excited, or because they are visible. They tend to be high-stakes, hard to measure, and politically loaded, which is the worst possible profile for a first attempt. When an optics use case underperforms, and a first attempt usually does, the visibility that made it attractive becomes the thing that gets it cancelled.
The strong answer picks for learnability. A use case that is contained, measurable, run often enough to produce real signal, and safe to get wrong. It will not impress anyone in a board deck. It will teach the team what good looks like and how to know it, which is the only thing a first use case is for. You can pick a use case to learn from, or one to be seen with. Rarely both.
4. Can the team reach their own production data this week?
A team cannot evaluate AI against reality if reality is out of reach. That is what this signal tests. Can the people doing the work get to the actual conditions the system has to perform in, or only a sanitized version of them?
The concrete test is access to production data. Can the team pull the real data for this use case this week, themselves, without a multi-week access request? Then I watch for friction. If the data sits behind tickets, approval chains, a data team's roadmap, or a legal review that has not started, the work stalls in week two, and the delay gets blamed on the AI rather than on the access problem that caused it. Toy data and synthetic samples are the same failure in a friendlier form. They let the team build against a version of reality that does not exist, and an evaluation that does not reflect reality is worse than none.
The strong answer is unglamorous. The team already has hands-on access to representative production data, or there is a named owner who can grant it inside the week. This is not a technical detail. A team that cannot reach reality cannot tell whether the AI works, and an organization that keeps reality that far from the people building against it has told you how it operates. It is one of the most reliable signals precisely because it is hard to fake.
5. Is there a plan for the first version failing?
If there is no plan for the first version failing, the first failure will end the program. Not because the failure was fatal, but because no one expected it.
So I ask: when the first version underperforms, what happens next, and who decides? The answer I do not want is silence, or "we will evaluate then," or the unspoken assumption that version one will work. AI work is iterative. The first version is supposed to be wrong. An organization that has not internalized that will read the first weak result as proof the whole thing does not work, and cancel it at the exact moment it would have started to improve.
The strong answer treats failure as scheduled. A defined iteration loop. A decision-maker named in advance. A budget and a timeline that assume more than one version. And a sponsor who has already told leadership this will look rough before it looks good, so the first weak demo is a milestone, not a crisis. The question is never whether the first version fails. It is whether the failure was planned for. A program that cannot survive its first bad result was never a program. It was a bet.
None of them are hard to fix
What surprised me, after watching this play out again and again, is that none of these signals are difficult to fix. None requires a larger AI budget. None requires a better model. None requires waiting for the technology to mature. They are leadership decisions, made or ducked in the first week. That is why they predict. A constraint tells you what an organization cannot do. A choice tells you what it will do, and these are all choices.
What the five have in common
They are all organizational, and the setup they describe is what the technology amplifies. Point a model at a strong setup, a clear sponsor, a real metric, a use case chosen to learn from, access to reality, a plan for failure, and it compounds. Point it at a weak one and it amplifies that instead, faster and more expensively than before. The model does not change the setup it landed in. It only magnifies it.
So before you start, ask the five. If four of them are weak, you have not found a reason to wait. You have found the actual work. The technology is rarely the leading indicator. The setup always is.
Sources
If this resonates with your organization's current state
A 2-week AI Delivery Diagnostic is the fastest way to understand the gap and what to do about it.
Book a call directly - no pitch, no commitment.