Claude vs ChatGPT for Insurance Underwriting Memos
Claude's 200K context window makes it better for analyzing full submission packages in one pass. ChatGPT's Custom GPTs let you build reusable underwriting assistants. Both hallucinate coverage terms and make up numbers. We tested both on three real submission scenarios.
Underwriting memos are one of the most promising use cases for AI in insurance. The format is predictable: risk summary, loss history, coverage analysis, pricing rationale, conditions, and recommendation. Every memo follows roughly the same structure whether you’re writing up a small BOP or a $10M commercial property account. That structural predictability is exactly what large language models are good at.
We tested Claude (Sonnet 3.5 and Opus) and ChatGPT (GPT-4o) on three real submission scenarios to see how each handles the core sections of an underwriting memo. The results were informative: both models produce passable first drafts, both hallucinate in predictable ways, and each has specific strengths that matter depending on your workflow.
This is not a general “which AI is better” article. We’re focused specifically on underwriting memo drafting, because the requirements are narrow enough to make an honest comparison.
Why Underwriting Memos Work as an AI Use Case
Underwriting memos have three properties that make them AI-friendly:
- Structured format. Most carriers use a template with fixed sections. AI excels at filling in structured templates when you provide the raw facts.
- Predictable inputs. You’re working from ACORD applications, loss runs, inspection reports, and financial statements. The data types don’t change much between submissions.
- High volume. A commercial lines underwriter might write 15-20 memos per week. Shaving 30 minutes per memo is real time savings.
The catch is that underwriting memos contain numbers, coverage terms, and risk assessments that must be accurate. A memo that reads well but states the wrong loss ratio or mischaracterizes an exclusion is worse than no memo at all.
Test Methodology
We tested both models on three submission types that represent different levels of complexity:
- BOP (Business Owners Policy): A small restaurant, $500K building, $200K BPP, $1M GL. Straightforward risk with a clean loss history.
- Commercial property: A 120,000 sq ft warehouse, $15M TIV, sprinklered, with two prior water damage claims totaling $340K.
- Professional liability: A 50-person accounting firm, $5M/$5M limits, with one prior claim (settled for $180K) involving tax preparation errors.
For each scenario, we provided the same input data to both models: the ACORD application data (typed out, not a PDF scan), loss run summaries, and a prompt asking for a standard underwriting memo with specific sections.
Head-to-Head Comparison Table
| Feature | Claude (Sonnet 3.5 / Opus) | ChatGPT (GPT-4o) |
|---|---|---|
| Context window | 200K tokens | 128K tokens |
| Full submission in one pass | Yes, handles 40-60 page packages | Yes for most, tight on large packages |
| Memo structure quality | Excellent; follows section templates precisely | Good; sometimes adds unsolicited sections |
| Risk characteristic accuracy | High when facts are supplied; does not infer | Moderate; tends to add “typical” risk factors not in the data |
| Loss history analysis | Accurate on stated facts; avoids speculation | Adds trend language even with only 2 data points |
| Pricing suggestions | Refuses unless given rate data | Offers “typical range” pricing (often wrong) |
| Custom reusable assistants | Projects feature (beta) | Custom GPTs (mature, shareable) |
| File upload handling | Direct paste or file upload | File upload with browsing |
| Monthly cost (Pro tier) | $20/month | $20/month |
| API cost per memo | ~$0.02-0.08 (Sonnet); ~$0.15-0.60 (Opus) | ~$0.03-0.10 (GPT-4o) |
Claude Strengths for Underwriting
Full Submission in One Context Window
Claude’s 200K token context window is the single biggest advantage for underwriting work. A typical commercial submission package (ACORD apps, loss runs, inspection report, financial statements, supplemental questionnaires) runs 40-60 pages. Pasted as text, that’s roughly 30K-50K tokens.
With Claude, you paste the entire package plus your memo template and get a single, coherent output. No chunking, no “here’s part 1, now here’s part 2.” The model sees every piece of data simultaneously, which means it can cross-reference the loss run against the application data without you manually pointing out connections.
In our commercial property test, Claude correctly noted that the two prior water damage claims (from the loss run) were consistent with the “flat roof, built 1998” detail from the application, and flagged roof condition as a key underwriting concern. We didn’t prompt it to make that connection.
Follows Formatting Instructions Precisely
When we gave Claude a 12-section memo template with specific formatting requirements (bullet points for risk characteristics, paragraph form for recommendation, table format for loss history), it followed every instruction. No extra sections, no reformatting to “improve” the layout.
ChatGPT, in contrast, added an “Executive Summary” section we didn’t ask for and converted our bullet-point specification into numbered lists in two of three tests.
Better at Structured Output
For underwriting memos specifically, Claude produces cleaner structured output. When asked for a table of loss history with columns for date, type, incurred, paid, and status, Claude formatted it consistently across all three tests. ChatGPT formatted the table correctly twice and used a different column order once.
ChatGPT Strengths for Underwriting
Custom GPTs for Reusable Workflows
ChatGPT’s Custom GPTs are the most practical feature for an underwriting team. You can build a “Commercial Property Underwriting Memo” GPT with your carrier’s specific template, standard language, appetite guidelines, and formatting preferences baked in. New underwriters can use it immediately without learning prompt engineering.
We built a Custom GPT for BOP underwriting that included our memo template, standard risk factors by class code, and formatting instructions. After setup (about 45 minutes), every memo came out in the right format without repeating instructions. Claude’s Projects feature offers something similar but is less mature and not as easily shared across a team.
Web Browsing for Market Context
ChatGPT can browse the web during a conversation. For underwriting, this means it can pull current industry data: OSHA violation records, news about the applicant, county property records, and similar context that enriches a memo.
Claude cannot browse the web. Any market context or industry data has to be pasted in manually.
In our professional liability test, ChatGPT found a relevant news article about regulatory changes affecting the accounting firm’s practice area and flagged it as a consideration. That’s genuinely useful context that would have taken 10 minutes of manual research.
Broader User Familiarity
This sounds trivial, but it matters for adoption. Most insurance professionals who’ve used AI have used ChatGPT. Training time is lower, resistance is lower, and there are more YouTube tutorials and community guides. If you’re rolling out AI memo drafting to a team of 20 underwriters, familiarity reduces implementation friction.
Head-to-Head: Specific Memo Sections
Risk Summary
Claude: Produced a concise, factual risk summary that stuck to the data we provided. For the warehouse test, the summary was 180 words covering occupancy, construction, protection class, TIV, and key exposures. No embellishment.
ChatGPT: Produced a longer risk summary (280 words) that included “industry-typical” risk factors we didn’t provide, such as “warehouses of this type commonly face inventory shrinkage and forklift damage exposures.” Those may be true, but they weren’t in our submission data, and stating them without evidence in a memo is a problem.
Winner: Claude. Underwriting memos should state what’s in the submission, not what’s “typical.”
Loss History Analysis
Claude: Presented the loss data in a clean table and wrote a two-sentence analysis: the two water damage claims totaling $340K over five years, with a note that both were related to the flat roof and that roof age (28 years) warranted inspection or replacement as a condition.
ChatGPT: Presented the same table but added trend language: “The loss trend suggests an escalating pattern of water intrusion.” With only two data points three years apart ($140K and $200K), calling this an “escalating pattern” is a stretch. It also calculated a loss ratio without us providing premium data (it estimated premium, which was wrong by about 35%).
Winner: Claude. Loss history analysis needs to be precise, and fabricating trend narratives from two data points is exactly the kind of error that makes AI output dangerous in underwriting.
Pricing Recommendation
Claude: Stated that it could not recommend pricing without rate tables, loss costs, or target loss ratios. It offered to structure a pricing section if we provided the numbers.
ChatGPT: Offered a “suggested premium range” of $45,000-$55,000 for the commercial property risk based on “comparable risks in the market.” We have no idea where those numbers came from. The actual market rate for this risk was approximately $38,000.
Winner: Claude. Refusing to guess is better than guessing wrong. A fabricated pricing recommendation in a memo could mislead a pricing actuary or create E&O exposure.
Declination Rationale
We asked both models to draft a declination memo for a submission that fell outside appetite (a restaurant with three fire losses in five years).
Claude: Wrote a professional, three-paragraph declination that referenced the specific loss history, stated the appetite guideline (we provided it), and recommended the broker resubmit if the insured implements specific risk improvements. Clean and appropriate.
ChatGPT: Wrote a similar declination but included language about “our commitment to helping the insured find appropriate coverage” and suggested specific alternative markets. The alternative market suggestions were real carrier names, but we have no idea if those carriers would actually write this risk. Recommending specific competitors in a declination memo is unusual and potentially problematic.
Winner: Claude. The declination was professional and stayed in scope.
Common Failure Modes
Both models fail in predictable ways when drafting underwriting memos. Knowing these patterns lets you catch errors in review.
Hallucinated Loss Ratios
Both models will calculate loss ratios if you provide loss data, even without premium data. They estimate premium based on “typical” rates, which are often wrong. ChatGPT does this more aggressively; Claude does it only when prompted.
Fix: Never include loss ratio calculations in AI-drafted memos unless you provide the actual premium figure.
Made-Up Industry Statistics
Both models cite industry statistics that sound plausible but are unverifiable. “Restaurants experience an average of 1.2 fire incidents per 100 policy years” is the kind of statistic that appears in AI output but doesn’t appear in any published source we could find.
Fix: Delete any industry statistics from AI output unless you can verify them against a published source (NFPA, ISO, AM Best).
Incorrect Coverage Terms
Both models occasionally use coverage terms loosely. In our professional liability test, ChatGPT referred to a “per-claim deductible” when the policy had a “self-insured retention,” which has different legal implications. Claude made a similar error in one test, using “aggregate limit” when the policy had “per-occurrence and aggregate.”
Fix: Verify every coverage term against the actual policy language. AI models treat these terms as interchangeable; underwriters and claims departments do not.
Over-Confident Recommendations
Both models write recommendations with more confidence than the data supports. “We recommend binding this risk at the quoted terms” appeared in ChatGPT’s output even though the prompt was “draft a preliminary underwriting memo for review.” Claude was better about hedging but still used “this risk meets our underwriting guidelines” without qualification.
Fix: Add explicit instructions about the memo’s purpose (preliminary review, not binding authority) and the appropriate level of confidence.
Practical Workflow Recommendations
Use Claude When
- You need to analyze a full submission package (40+ pages) in one pass
- Formatting precision matters (carrier-specific templates)
- The memo will be reviewed by compliance or legal (Claude’s conservative output reduces risk)
- You’re working on complex risks where sticking to stated facts is critical
Use ChatGPT When
- You’re building reusable workflows for a team (Custom GPTs)
- You need supplemental market research during memo drafting
- You’re training junior underwriters who are already familiar with ChatGPT
- The submission is straightforward and you want speed over precision
Use Both When
- Draft with Claude for the core memo, then use ChatGPT to research market context or competitor appetite
- Build the Custom GPT template in ChatGPT, but validate complex risks by pasting the same data into Claude for a second opinion
Always Verify
Regardless of which model you use, every AI-drafted underwriting memo needs human review of:
- All dollar amounts and percentages
- Coverage terms and policy references
- Loss ratios and any calculated metrics
- Industry statistics and market claims
- The recommendation section (does it match your actual assessment?)
Cost Comparison for Underwriting Teams
For a team writing 50 memos per week:
| Cost Factor | Claude | ChatGPT |
|---|---|---|
| Pro subscription (per seat) | $20/month | $20/month |
| API cost (50 memos/week) | $4-16/week (Sonnet) | $6-20/week (GPT-4o) |
| Custom GPT setup | N/A (Projects, limited) | One-time, ~45 min per template |
| Training time | 2-4 hours per underwriter | 1-2 hours (more familiar UI) |
The subscription costs are identical. API costs vary based on memo length and model choice. The real cost difference is in training and workflow setup, where ChatGPT’s Custom GPTs provide a meaningful advantage for teams.
The Bottom Line
Neither model replaces an underwriter. Both produce first drafts that save 20-40 minutes per memo, and both require careful review before anything goes into a file.
Claude is the better choice for complex risks and situations where accuracy matters more than speed. Its larger context window, conservative output style, and willingness to say “I don’t have enough data” make it the safer tool for underwriting work.
ChatGPT is the better choice for teams that need reusable workflows, supplemental research, and faster onboarding. Custom GPTs are a genuinely useful feature that Claude hasn’t matched yet.
If your budget allows it, the best approach is to have access to both and use each where it’s strongest. At $20/month per seat, the cost of both is trivial compared to the time savings on memo drafting.