Insights AI News How prompt politeness LLM accuracy study improves answers
post

AI News

30 Oct 2025

Read 15 min

How prompt politeness LLM accuracy study improves answers

prompt politeness LLM accuracy study shows tone can raise answer accuracy and improve prompt design

A new prompt politeness LLM accuracy study tested how tone changes model answers on multiple-choice tasks. Researchers rewrote 50 questions into five tones, from very polite to very rude, and ran them through ChatGPT-4o ten times each. Results showed rude phrasing scored higher than polite phrasing by up to four percentage points, with significant differences.

What the prompt politeness LLM accuracy study tested

The study asked a clear question: does a polite or rude tone change how well a large language model answers? The team built a small but focused dataset. It held 50 base questions in math, science, and history. Each question had four choices. Only one choice was correct. Then the team wrote five versions of each question with different tones:
  • Very Polite
  • Polite
  • Neutral
  • Rude
  • Very Rude
  • This produced 250 prompts. All prompts used short tone prefixes before the same core question. For example, Very Polite used “Would you be so kind as to…” while Very Rude used “You poor creature, do you even know how to solve this?” The aim was to change tone only, not content.

    Dataset and tone design

    The base questions were made to be moderate to hard. Many required two or more steps to solve. This helped reduce ceiling effects. Each question retained the same structure and answer choices across tones. The tone was only a short prefix added to the question text. Examples of tone prefixes:
  • Very Polite: “Would you be so kind as to solve the following question?”
  • Polite: “Please answer the following question:”
  • Neutral: no prefix
  • Rude: “If you’re not completely clueless, answer this:”
  • Very Rude: “Hey gofer, figure this out.”
  • The team kept the rest of the prompt instructions the same across trials. They asked for only the letter of the answer (A, B, C, or D) and no explanation. They also added a reset line to start each prompt fresh.

    How the team measured accuracy

    The researchers tested ChatGPT-4o through the API. They ran the 50-question set ten times for each tone. They computed accuracy per run. Then they averaged the accuracy and measured the range. Finally, they used paired sample t-tests to compare tones. This matters because the same 50 questions were used under each tone. So a paired test fits the design and checks if differences go beyond random noise.

    Key results: rudeness gave a small edge

    The results showed a clear pattern. Impolite prompts performed better than polite prompts. The difference was not huge, but it was consistent and statistically significant in many pairwise tests.

    Accuracy by tone

    Average accuracy across ten runs:
  • Very Polite: 80.8% (range 80–82)
  • Polite: 81.4% (range 80–82)
  • Neutral: 82.2% (range 82–84)
  • Rude: 82.8% (range 82–84)
  • Very Rude: 84.8% (range 82–86)
  • The study found a near stepwise rise from polite to rude. The Very Rude tone had the highest average and the widest top score in its range.

    Statistical significance

    The researchers ran paired sample t-tests between every pair of tones. For eight pairs, the p-value was below 0.05. In those pairs, polite or very polite tones were worse than neutral, rude, or very rude. Neutral beat polite, but lost to very rude. Rude lost to very rude as well. The most consistent winner was very rude language, which beat all other tones with strong significance.

    Why might tone change answers?

    This result feels odd. Why would a rude prefix improve answers? The study does not claim a full cause. But it offers ideas and points to prior work. Possible reasons:
  • Shorter, sharper instructions may reduce noise. Polite phrases add extra words that do not help the model solve the task. Fewer filler words may lower confusion.
  • Direct or forceful tones may cue a “do the task now” mode. The model may compress its internal steps when the prompt sounds firm.
  • Token likelihood and perplexity may matter. Words like “please” and “kindly” may create linguistic patterns that the model treats as small talk. Rude imperatives may align better with command-style prompts.
  • Model alignment may have shifted. Newer models may react differently to tone than older ones. Safety layers may also shape how tone is handled.
  • Instruction clarity likely dominates tone. The prompt also asked for only the letter and no explanation. This narrow output rule may amplify small tone effects.
  • These are hypotheses. The study calls for more tests on prompt length, token choice, and cultural variation.

    How does this compare to earlier research?

    The findings differ from some past reports that linked rudeness to worse results. A 2024 paper by Yin and colleagues found that impolite prompts often reduced accuracy for older models like ChatGPT-3.5 and Llama2-70B. But even in their tests on ChatGPT-4, the rudest prompt was not always the worst. Their accuracy range across tones was tight, and not strictly monotonic. There are key differences between the two works:
  • Model generation: This study used ChatGPT-4o. Prior work often used earlier generations or open models.
  • Rudeness strength: The phrases differ. Yin et al. included harsher insults in some prompts. The new study used rude phrasing, but not the most extreme forms.
  • Task set: Both works used multiple-choice, but with different questions and variability.
  • Sampling: This study ran ten repeated trials per tone and used paired t-tests across the same items.
  • Conclusion: tone effects exist, but they depend on the model, wording, and setup. The prompt politeness LLM accuracy study at hand suggests that newer models may be less hurt by rudeness, and may even benefit from it in narrow testing.

    Practical takeaways for prompt writers

    The lesson is not “be rude.” The lesson is “be direct.” You want clarity without harm. Here is what to try.

    Use firm, clear tone without insults

    Give short, assertive instructions. Avoid filler. Skip “Would you kindly…” and similar phrases. Use active verbs.
  • Say: “Answer with A, B, C, or D only.”
  • Say: “Solve in three steps, then give the final letter.”
  • Avoid: “Could you please, if you don’t mind, provide…”
  • Keep instructions tight and specific

    The study asked for letter-only output and no explanation. This can reduce off-topic text. It may also prime the model to focus on the decision.
  • Define required output format once.
  • Avoid extra chat or small talk in the prompt.
  • Use one task per prompt when you can.
  • Control for randomness

    Models have some variance between runs. If you test prompts, do multiple runs. Log seeds if possible.
  • Repeat runs per prompt.
  • Use the same items for fair comparisons.
  • Run paired tests to compare tones or formats.
  • Evaluate across models

    Tone sensitivity is not universal. One model may show an effect; another may not.
  • Test at least two models for important tasks.
  • Weigh costs, latency, and accuracy together.
  • Expect the effect size to be small but noticeable.
  • Limitations and open questions

    The authors note several limits.

    Small dataset

    The study used 50 base questions and five tone variants. This is solid for a pilot, but small for broad claims. More items across more domains would help.

    Single task type

    The test used multiple-choice accuracy. That is clean to score, but it does not measure reasoning quality, safety, or faithfulness. Future work should test free-form answers, coding, and step-by-step math.

    Model coverage

    The main results are for ChatGPT-4o. Early checks with Claude and another GPT version suggested a tradeoff between cost and accuracy. But the paper does not present full cross-model statistics yet.

    Tone definitions

    “Polite” and “rude” are cultural. The phrases used capture only a slice of how tone appears in real life. Cross-lingual tests and more nuanced tone tags would be useful.

    Ethics and product design

    We should not promote hostility. Demeaning language can harm users, teams, and brands. The safe path is to borrow the performance benefits of clarity without the harm of insults. Use direct, crisp instructions. Avoid toxic words. If you simulate tone in testing, keep it in the lab—not in production. Consider guardrails:
  • Set style guides for prompts shipped to users.
  • Block toxic tokens in user-visible text.
  • Test “firm but respectful” prompts that keep the gains without the risk.
  • How to run your own tone test

    If you want to check tone effects in your workflow, try a simple plan.

    Step 1: Build a small, tough test set

    Create 40–100 items for your real task. Make sure the answers are known. Mix difficulty levels. Keep the same content across tones.

    Step 2: Write three tone variants

    Use direct prefixes only. For example:
  • Polite: “Please answer the following question.”
  • Neutral: no prefix.
  • Firm: “Answer the question now.”
  • Avoid insults. This keeps tests ethical and closer to production.

    Step 3: Fix output rules

    Specify the output format. For example: “Reply with A/B/C/D only. No explanation.”

    Step 4: Run repeated trials

    Run each variant several times. Change the order of items. If you can, fix a seed to reduce randomness.

    Step 5: Analyze

    Compute accuracy per run and per tone. Use paired t-tests across tones on the same items. If the difference is significant and practical, adopt the winning tone.

    What this means for teams and tools

    Small wording choices still matter. This study suggests that newer LLMs can be nudged by tone, and that strong, clear instructions help. The gains are modest, but they are real and cheap. They stack with other prompt tactics, like chain-of-thought, few-shot examples, and structured output. Consider a layered approach:
  • First, make the task and output format explicit.
  • Second, remove filler and chit-chat.
  • Third, use a firm tone that signals urgency and focus.
  • Fourth, measure and iterate with a stable test set.
  • These steps improve reliability without adding cost or latency. They also keep your UX respectful.

    Bottom line

    The study’s main result is simple: tone shifts accuracy by a few points, and ruder phrasing outperformed polite phrasing on a controlled multiple-choice test with ChatGPT-4o. You do not need to insult a model to get better answers. You do need to be short, clear, and direct. Treat these results as a nudge to tighten instructions and reduce noise. If you plan your own prompt politeness LLM accuracy study, test firm, respectful wording, and you may see the same gains.

    (Source: https://www.arxiv.org/pdf/2510.04950)

    For more news: Click Here

    FAQ

    Q: What did the prompt politeness LLM accuracy study test? A: The prompt politeness LLM accuracy study tested whether changing the politeness or tone of prompts affects large language model accuracy on multiple-choice questions. The researchers rewrote 50 base questions into five tone variants and evaluated 250 prompts with ChatGPT-4o. Q: What tones were included and how were they applied? A: The study used five tone levels—Very Polite, Polite, Neutral, Rude, and Very Rude—implemented as short prefix phrases added to the same core question. Examples ranged from “Would you be so kind as to…” for Very Polite to “Hey gofer, figure this out.” for Very Rude, while the question stem and answer choices remained identical across variants. Q: What were the main accuracy results across tones? A: Across ten runs per tone with ChatGPT-4o, average accuracies were 80.8% (Very Polite, range 80–82), 81.4% (Polite, range 80–82), 82.2% (Neutral, range 82–84), 82.8% (Rude, range 82–84), and 84.8% (Very Rude, range 82–86). The prompt politeness LLM accuracy study reported this near-stepwise rise from polite to rude and found several comparisons to be statistically significant. Q: How did the researchers test statistical significance? A: They used paired sample t-tests because the same 50 questions were presented under each tone and each tone was run ten times to generate per-run accuracy scores. The paper reports eight tone pairs with p-values below 0.05, indicating significant differences in those comparisons. Q: Why might rude prompts have produced slightly better accuracy? A: The authors offer hypotheses rather than causal claims, including that shorter, more direct prefixes reduce filler and confusion and that command-like phrasing may have different token likelihoods or perplexity that cue task-focused behavior. They also note that model alignment and training differences could make newer models respond differently to tone, but these explanations require further testing. Q: What limitations should readers keep in mind? A: Key limitations are the modest dataset of 50 base questions (250 variants), the single-task focus on multiple-choice accuracy, and the primary reliance on ChatGPT-4o with only preliminary checks on other models. The paper also emphasizes that the operationalization of “politeness” here is narrow and culturally specific, limiting generalizability. Q: What practical takeaways should prompt writers use from this study? A: The prompt politeness LLM accuracy study suggests being short, explicit, and firm—specifying the required output format (for example, “reply with A/B/C/D only”) and avoiding filler phrases—to modestly improve multiple-choice accuracy. It also recommends running repeated trials, using the same items for fair comparisons, and testing across models before adopting prompt changes in production. Q: Should developers use rude prompts in production? A: No; the authors explicitly caution against deploying hostile or demeaning language in user-facing systems despite the lab finding that rude phrasing gave a small accuracy edge. They recommend achieving clarity and firmness without insults and applying guardrails such as style guides and blocking toxic tokens for production prompts.

    Contents