AI in internal processes: guide for mid-market companies
Where AI has clear ROI, where it is marketing, and how to avoid paying OpenAI to do what a bash script does better.
The question "where should I apply AI in my company?" has two bad answers and one good one. The first bad answer is "everywhere": you will spend thousands on LLMs to replace tasks a 20-line script does better. The second bad answer is "nowhere because it is a bubble": you will miss real opportunities while competitors move. The good answer is: in problems with three specific characteristics.
When AI has clear ROI
Processes where AI pays for itself have these 3 characteristics:
- Unstructured input (free text, documents, audio, images). If the input is structured (database rows), you do not need AI; you need SQL.
- Output with human-equivalent error tolerance. If you need 99.99% transactional precision, AI does not apply yet.
- High volume or high human cost. AI has fixed engineering cost. For 50 executions per month, it does not pay for itself.
Cases where it works (with ROI examples)
- Support ticket classification and routing. 1,500 tickets/month, first-level human agent costs US$1,800/month, AI costs ~US$80/month in API + setup. ROI month 1.
- Data extraction from invoices/contracts PDFs. 800 manual documents/month = 60h of work; AI does 95% with 2 human hours of review. Savings: ~58h/month.
- RFP and proposal response generation. Template + client context = first draft in 5 min vs 4 manual hours. Watch quality: always human review before sending.
- Fraud or anomaly detection in data. Hard-coded rules do not scale to changing patterns; models do.
- Internal semantic search (intranet, documentation, code). Any company with > 200 internal documents wins.
- Technical onboarding. Chatbot over internal documentation reduces 40% of tickets from new developers.
- Executive report generation from BI data. Monthly summary of 3 dashboards in one readable paragraph.
Cases where it does NOT work (or not yet)
- Autonomous financial decisions. Any action that moves money requires final human validation.
- Replacing advanced technical support. AI does tier 1 well, tier 2 poorly, tier 3 terribly.
- Strict compliance processes without traceability. If a regulator can ask "explain this decision", pure AI is not enough; you need an auditable trail.
- "Human" email personalization. Customers recognize AI patterns after 3 emails. It damages the relationship.
- Structured tabular data analysis. Excel, SQL or a Python script do the same work more reliably and cheaply.
Real cost of an internal AI project
| Component | Typical cost |
|---|---|
| Discovery + solution design | US$3,000-6,000 (1-2 weeks) |
| Functional MVP development | US$8,000-18,000 (3-5 weeks) |
| API tokens (typical mid-market volume) | US$80-400/month |
| Maintenance and improvements | US$600-2,000/month |
| Observability setup (Helicone, Langfuse) | US$0-80/month |
A typical project breaks even in 4-7 months if it attacks a real process. If you do not break even within 9 months, the use case selection was wrong.
Which model should you choose?
- Claude Sonnet / GPT-4o: default for 90% of cases. High quality, reasonable latency, middle price.
- Claude Haiku / GPT-4o-mini: high volume, simple tasks (classification, basic extraction). 10-20x cheaper.
- Open-source models (Llama, Mistral): when data cannot leave your infrastructure. Self-hosted GPU = US$500-2,000/month floor.
- Embedding models (Voyage, OpenAI): for semantic search and RAG. Negligible cost.
Common mistakes
- Starting with the most expensive model. Test with Haiku/mini first; 60% of cases do not need more.
- Not measuring output quality. You need an eval set (50-100 cases) to run before every prompt or model change.
- No guardrails. LLMs hallucinate. Validate JSON output with schemas (Zod), retry with corrected prompt, and fallback to human if it fails twice.
- Bad RAG. RAG quality depends on chunking and embedding strategy, not the model. If your RAG is bad, it is not the LLM's fault.
- Ignoring compliance. If your data has PII, configure Anthropic/OpenAI with a DPA, or use a masking proxy.
Recommended base architecture
For 80% of internal cases: FastAPI/Express -> LangChain/Vercel AI SDK -> Anthropic/OpenAI -> Postgres with pgvector for RAG -> lightweight frontend. Add Langfuse or Helicone for tracking. Total: 3-5 days of a senior dev for base architecture, then weeks depending on the case.
Should I train my own model?
Almost never. Fine-tuning only pays off if you have > 50,000 high-quality examples, a very specific problem, and latency/cost from the large model blocks you. For 99% of mid-market companies, good prompting + RAG beats fine-tuning.
Will AI replace my operations team?
No. It will redistribute their time. The team will do fewer mechanical tasks (classify, transcribe, summarize) and more judgment tasks (review exceptions, improve prompts, optimize workflows). The company that understands this grows. The one that naively fires people loses the person who understands context when AI fails.
How long does the first case take?
From discovery to production: 4-6 weeks for a simple case (classification, extraction), 8-12 weeks for an agent with tools and memory. If someone promises less, you will get a demo, not an operational product.
What we recommend
Start with one case with clear, measurable ROI. Ideally classification or extraction over high-volume unstructured data. Budget: US$12-18k. Timeline: 6 weeks. Measure ROI at 3 months. If it pays, expand to 2-3 more cases. If it does not pay, do not insist; the case was wrong and stopping is better than doubling down.
The big trap right now is chasing autonomous agents before you have a simple case in operation. Companies that start with a "full-autonomous agent" spend 6 months realizing the simple classification bot would have delivered 80% of the value with 20% of the effort.