How I Built an AI Testing Tool from a ChatGPT Experiment
I never planned to build a product. My team was rewriting the same test cases every sprint, and I wanted to fix that. Not with a new process document. Not with another template. I wanted to see if AI could handle the repetitive part for us.
That experiment became Sarah — a Custom GPT that generates test plans, test cases, and automation code from API documentation. It took 13 versions to get right. No backend, no code, no development team. Just prompt engineering, knowledge files, and a lot of iteration.
Here's how it happened.
The Problem
If you've worked in QA for any real amount of time, you know what test case writing looks like. You sit down with a new feature. You already know the structure before you start — positive path, negative path, boundary conditions, validation rules, error handling. The categories don't change. The format doesn't change. The thinking patterns don't change.
What changes is the specific feature. Login today, payment flow tomorrow, user profile next week. But the underlying work — mapping requirements to test scenarios, writing preconditions, defining expected results — follows the same structure every time.
My team was spending hours on this every sprint. Good engineers, doing repetitive work that didn't need their full expertise. The test cases were solid, but the process was slow. And the slower it went, the more pressure built up on the testing timeline.
I'd been watching LLMs closely. When ChatGPT became available, I didn't think about products or startups. I thought: can this thing write a test case that my team wouldn't have to throw away?
First Attempt: Raw Prompting
The first version had no structure at all. It was ChatGPT in a browser tab and a prompt I kept editing in a text file.
I started simple — described a feature, asked for test cases, looked at what came back. The first results were generic. The kind of test cases you'd find in a textbook: "Verify that the login button works." Not useful for real work.
But when I started providing actual context — the requirements, the validation rules, the API contract, the response codes — the output improved. It wasn't ready to use as-is. But I was editing instead of writing from scratch, and that cut the time roughly in half.
The first version was this: a prompt template in a text file, ChatGPT in a browser, and copy-paste into a spreadsheet. No code. No automation. No infrastructure.
Here's what an early prompt looked like:
You are a senior QA engineer specializing in API testing.
Generate test cases for the following endpoint.
Endpoint: POST /api/users
Description: Creates a new user account
Request body:
- email (string, required, valid email format)
- password (string, required, min 8 chars, uppercase + number)
- name (string, required, max 100 chars)
Response codes:
- 201: User created
- 400: Validation error
- 409: Email already exists
Generate test cases covering:
1. Positive scenarios (valid inputs, successful creation)
2. Negative scenarios (missing fields, invalid formats)
3. Boundary conditions (min/max lengths, special characters)
4. Security considerations (injection attempts in fields)
5. Error handling (duplicate email, malformed requests)
Format: ID | Title | Preconditions | Steps | Expected Result | PriorityThe pattern was clear: the more specific the input, the more useful the output. Vague prompts produced vague test cases. Detailed prompts — with actual field names, actual validation rules, actual response codes — produced something my team could work with.
After a few weeks of this, I had a collection of prompt templates covering common API patterns: CRUD operations, authentication flows, file uploads, pagination, error handling. My team started using them. The feedback was consistent: the output saves time, but the manual workflow is painful.
Building Sarah as a Custom GPT
That feedback is what pushed me to build something more structured. ChatGPT had released Custom GPTs — the ability to create a specialized assistant with its own instructions and knowledge files. Instead of copying prompt templates back and forth, I could package the whole approach into a single tool.
The first version of Sarah as a Custom GPT was rough. I took my best prompt templates, combined them into a system instruction, and uploaded a few reference documents about test case standards and API testing patterns.
It worked, but barely. The output was inconsistent. Sarah would forget formatting rules mid-conversation. The test cases would drift from the original requirements. Edge case coverage was spotty.
So I started iterating.
Building a Custom GPT is not a one-time setup. It's a feedback loop: use it, find where it fails, adjust the instructions, test again. Each version should fix specific problems from the previous one.
13 Versions of Getting It Right
Each version of Sarah addressed specific failures from the one before. Here's what that iteration looked like:
Early versions (v0.1–v0.3) focused on getting the basic output format right. Test cases needed consistent structure — ID, title, preconditions, steps, expected results. The system instructions had to be extremely explicit about formatting, or the output would vary between conversations.
Mid versions (v0.4–v0.7) tackled coverage quality. I added knowledge files with API testing patterns, common vulnerability types, boundary condition categories. Sarah started generating edge cases I hadn't explicitly asked for — things like Unicode characters in input fields, concurrent request handling, and timeout scenarios.
Later versions (v0.8–v0.11) worked on supporting different input types. I wanted Sarah to handle Swagger specifications, Postman collections, plain text descriptions, and even CURL commands. Each input type needed different parsing instructions in the system prompt.
Recent versions (v0.12–v0.13) refined the modification workflow. Instead of regenerating everything when something needed to change, Sarah could update specific sections — add more negative scenarios for one endpoint, change the test framework, adjust priority levels. This made the tool practical for real iterative work.
# How Sarah's instructions evolved over 13 versions
## v0.2 — basic formatting
"Generate test cases in table format with columns:
ID, Title, Steps, Expected Result"
## v0.7 — structured categories
"For each API endpoint, generate test cases in categories:
- Positive: All valid input combinations
- Negative: Each validation rule violated individually
- Boundary: Min/max values, empty strings, null values
- Security: Injection attempts, auth bypass, rate limiting
- Error: Server errors, timeout, malformed responses
Use format: TC-{endpoint}-{category}-{number}"
## v0.13 — full analysis pipeline
"Analyze the provided API documentation completely before
generating any output. For each endpoint:
1. Extract: HTTP method, path, parameters, request body,
response codes, authentication requirements
2. Identify: All validation rules, business logic constraints,
and implicit requirements
3. Generate test cases with full traceability to source docs
4. Include: Test data recommendations, automation hints,
and risk-based priority assignments
5. Flag: Any gaps or ambiguities in the provided documentation"The knowledge files grew with each version too. I added documents covering REST API testing standards, common HTTP status code scenarios, authentication testing patterns, and file upload edge cases. Each document gave Sarah more context to draw from without me having to repeat it in every conversation.
What Sarah Does Today
Sarah v0.13 is a Custom GPT that takes API documentation and produces structured testing artifacts. You provide a Swagger spec, a Postman collection, or a plain description of your API — and Sarah generates:
A test plan — organized by endpoint, with test strategies for each HTTP method and recommendations for coverage depth based on risk.
Detailed test cases — covering positive scenarios, negative scenarios, boundary conditions, security checks, and edge cases. Each test case includes preconditions, steps, expected results, and priority.
Executable test code — automation scripts formatted for common frameworks, ready to adapt to your project setup.
Modification support — ask Sarah to adjust specific sections, add scenarios, change formats, or increase coverage for particular endpoints. The conversation keeps context, so you don't start over each time.
The whole process takes a few minutes for a typical API with 10-15 endpoints. My team still reviews and adjusts the output — but they're reviewing, not writing from scratch. That's the difference.
Sarah is publicly available as a Custom GPT: Sarah - Smart Test Management Framework v0.13. You can try it with your own API documentation.
What I Learned Building This
A few observations from 13 versions of iteration.
Prompt engineering is a QA discipline. You're defining inputs, expected outputs, edge cases, and validation criteria — the same work we do when writing test cases. The difference is that the system under test is the AI model instead of an application. Every version of Sarah's instructions went through the same process: define what it should do, test it, find where it fails, fix it.
Knowledge files matter more than instructions. The system prompt tells Sarah how to behave. The knowledge files tell her what to know. When I added a structured document about REST API security testing patterns, the quality of security-related test cases improved across every conversation — without changing a single line of instructions.
Version control applies to prompts too. I kept every version of Sarah's system instructions and knowledge files. When a new version introduced a regression — like suddenly producing worse output for Postman collections — I could compare the changes and find what broke. Treating prompts like code isn't a metaphor. It's a practical requirement.
The "good enough to edit" bar is the right target. I spent the first few versions trying to make Sarah's output perfect — ready to use without any changes. That was the wrong goal. The right goal was output that's good enough to review and adjust. That's where the time savings come from. My team went from writing test cases in 3-4 hours to reviewing and adjusting them in 30-45 minutes.
If you're building a Custom GPT for your work: keep a log of every failure you see. Not just "the output was bad" — write down exactly what was wrong and what the output should have been. That log becomes your specification for the next version.
What This Approach Doesn't Solve
I want to be direct about the limitations.
Sarah generates test cases based on the documentation you provide. If the documentation is incomplete — missing validation rules, unclear error handling, undocumented side effects — the test cases will have the same gaps. AI doesn't invent requirements. It works with what you give it.
The output also lacks business context. Sarah doesn't know that your payment endpoint is the highest-risk part of the system and needs deeper coverage. She doesn't know that a specific edge case caused an incident last quarter. She doesn't know your team's risk tolerance or your release schedule.
That context — the judgment about what matters, what's risky, and what to prioritize — is still the QA engineer's job. Sarah handles the volume and the structure. The engineer handles the decisions.
This is how I see AI fitting into QA work: less time on repetitive output, more time on analysis and judgment. The tool handles the parts that follow patterns. The people handle the parts that require thinking.
Why I Stopped Working on Sarah
I haven't touched Sarah in about six months. Not because it stopped working — it still does what it was built to do. I stopped because AI moved faster than my original idea.
As I worked with LLMs more, I realized that one general-purpose tool trying to handle everything — test plans, test cases, automation code, modifications — wasn't the strongest approach. What worked better was smaller, specialized agents, each doing one thing with high accuracy. One agent that extracts requirements from code. Another that writes Playwright tests from those requirements. Another that validates coverage. Another that analyzes failures.
An army of small agents, each focused on a single task, working together in a pipeline. That architecture produces better results than one tool trying to be everything at once.
But that's another story.
Sarah taught me how to think about AI tooling for QA. The small agents approach is where that thinking led next.
If You're Considering Something Similar
I didn't start with a product vision. I started with a specific problem my team had and a question: can this tool help?
If you have a repetitive process in your work — something where you already know the pattern and you're executing it again and again — try building a Custom GPT for it. Start with your best prompt. Upload your reference documents as knowledge files. Test it against real work, not sample data. Write down every failure. Fix them one version at a time.
That's the whole method. Sarah is at version 0.13 not because I planned 13 versions, but because each version showed me something the previous one missed.
Start with version 0.1. See where it breaks. Fix it. That's version 0.2.
Pick one specific, repeatable task from your daily work. Build a Custom GPT for it. Test it against your last five real tasks — not hypothetical ones. If the output saves your team even an hour per week, you have something worth developing further.
Want to Learn More?
I write about AI in QA, test automation, and practical tool building on this blog. If you want to follow along — including future updates on Sarah — check the blog regularly or reach out directly.
Have you tried building a Custom GPT for your testing work? I'd be interested to hear what worked and where it fell short. Reach out on the contact page to share your experience.