How to design a coding assessment that holds up against AI-assisted candidates

Summary

💡Key takeaways

No single layer is sufficient. Question design alone fails because some AI-resistant questions are operationally expensive to generate. Integrity infrastructure alone fails because candidates find ways to bypass it. Behavioural analysis alone fails because false positives create unfair candidate experience. Human review alone fails because reviewers can't reliably distinguish AI-assisted code from human code without supporting signals.
The four layers reinforce each other. Each layer catches failure modes the other layers cannot address. Question design reduces the number of cases that need integrity infrastructure to catch. Integrity infrastructure reduces the cases that need behavioural analysis. Behavioural analysis reduces the cases that need human review. Human review provides final discipline for cases where the technical layers produce ambiguous signal.
Standard algorithmic questions have lost discriminative power. AI models solve these in seconds. Question libraries should shift toward context-dependent, interactive, judgment-heavy, debugging-format, and domain-specific question types - categories where AI capability is meaningfully lower.
The system requires continuous maintenance. Quarterly question library audits, quarterly integrity infrastructure penetration testing, monthly behavioural signal tuning, monthly review consistency audits, quarterly fairness audits, annual outcome measurement. Coding assessment integrity is operational discipline, not a one-time setup.

The short answer

A coding assessment that holds up against AI-assisted candidates is not a coding assessment with a proctoring tool layered on top. It's an assessment designed across four reinforcing layers: question design that resists AI-assisted solving even when AI is available, integrity infrastructure that prevents AI usage during the assessment itself, behavioural and pattern analysis that flags suspicious patterns post-submission, and structured human review of flagged sessions before consequential decisions are made.

Most hiring teams under-invest in three of these four layers and over-rely on the fourth. The over-relied-on layer varies - some teams trust proctoring alone, some trust harder questions alone, some trust manual review alone - but the failure mode is the same: a single layer of defence in an environment where AI candidates have multiple attack vectors.

This guide walks through the operational sequence for designing a coding assessment that genuinely holds up. The order matters: question design first, integrity infrastructure second, behavioural analysis third, human review fourth. Skipping or weakening any single layer compromises the others.

Why this is harder than it looks

Three forces have made coding assessment integrity meaningfully more difficult than it was three years ago:

The first is AI tool ubiquity. ChatGPT, Claude, Copilot, Gemini, Perplexity, and the steadily more capable AI assistants are now available on every device a candidate might own, often in multiple integration patterns. Browser extensions, desktop apps, system-tray utilities, keyboard shortcuts, mobile apps, IDE integrations, voice-input access. The candidate doesn't need to deliberately seek out AI tools; the tools are present by default in the candidate's working environment.

The second is AI capability on coding tasks specifically. Most standard coding interview questions - string manipulation, array algorithms, data structure implementation, basic system design - are well within the capability range of current AI models. The model produces correct, idiomatic code with clear comments, sometimes in cleaner style than experienced human developers would write. The gap between AI-assisted candidate output and strong human candidate output has narrowed substantially.

The third is the cost of false negatives. A candidate who used AI assistance and passes the assessment becomes a hire who was evaluated on capabilities they don't actually have. The cost is measured in mis-hire consequences - performance gaps that surface in the first 90 days, ramp time that extends because foundational skills are weaker than assessed, team capacity decisions made on inaccurate signal. False positives (rejecting strong candidates wrongly flagged) are also costly, but the operational pattern most teams should be optimising against is false negatives, not false positives.

The implication: designing for AI-resistance is no longer optional for hiring teams whose coding assessments produce consequential decisions. It's the baseline requirement.

Layer 1 - Question design that resists AI even when AI is available

The first layer is question design that makes AI-assisted solving harder, regardless of whether the integrity infrastructure successfully prevents AI access during the assessment. The principle: questions whose answers depend on context, judgment, or interactive reasoning are meaningfully harder for AI to solve well than questions that test pattern-matching against known algorithms.

Categories of questions that resist AI well:

Context-dependent questions. Questions that require the candidate to evaluate or modify code in the context of specific business constraints, performance requirements, or system characteristics. AI models can produce generic correct answers; they struggle to produce contextually appropriate answers when the context shifts the optimal solution. Example category: given this codebase architecture and these specific operational constraints, identify and fix the performance bottleneck.

Interactive reasoning questions. Questions that unfold across multiple steps, with the candidate's decisions at each step shaping the subsequent questions. AI models can solve the first step competently; multi-step interactive sequences expose whether the candidate genuinely understands the underlying principles or is producing AI-generated code without comprehension. Example category: here's a partial implementation; explain why the approach is flawed, propose three alternatives, and implement the one you'd choose given specific constraints.

Judgment-heavy questions. Questions that don't have a single correct answer but require the candidate to defend their approach against specific tradeoff scenarios. AI models can produce defensible answers; they struggle when the candidate is asked to explain why their approach is better than alternatives the interviewer specifies. Example category: implement this in two different ways, then defend which approach is appropriate for a system handling X requests per second with Y latency constraint.

Debugging-format questions. Questions that present existing buggy code and ask the candidate to identify, explain, and fix the bug. AI models can find bugs competently when given the entire codebase; they struggle when the bug requires understanding non-obvious system interactions or when the candidate is asked to walk through their debugging reasoning. Example category: this code passes its unit tests but fails in production with this specific symptom; walk through your debugging approach.

Domain-specific questions calibrated to the actual role. Questions drawn from the specific technical domain the candidate would work in - payments systems for a fintech role, distributed systems for a backend infrastructure role, real-time data for a streaming-platform role. AI models perform well on generic problems and meaningfully worse on domain-specific problems where the right answer depends on understanding the specific operational realities of that domain.

Categories of questions that resist AI poorly:

Standard algorithmic puzzles.Reverse a linked list, find the second largest element, implement merge sort. These are exactly what AI models are trained on extensively. A candidate with AI access produces a correct, well-commented solution in seconds. The discriminative power of these questions has dropped substantially.

Standard data structure implementations.Implement a hash map, implement a binary search tree, implement a queue using two stacks. Same problem - AI models produce textbook-quality implementations on demand.

Standard system design questions.Design Twitter, design Uber, design a URL shortener. These have been written about extensively; AI models produce competent design walkthroughs that are difficult to distinguish from strong human candidates.

The operational discipline: audit your question library. For each existing question, ask: could an AI-assisted candidate produce a strong answer in 30 seconds? If yes, the question's discriminative power has dropped substantially. The question library needs to shift toward context-dependent, interactive, judgment-heavy formats.

This shift is operationally expensive. Generating high-quality context-dependent questions requires senior engineering judgment, not just question-bank purchasing. Most assessment vendors' default question libraries are heavily weighted toward the categories AI solves well. Building a question library that resists AI is genuinely a competitive moat - and it's the foundation of everything else.

Layer 2 - Integrity infrastructure that prevents AI usage during the assessment

The second layer is the integrity infrastructure that prevents the candidate from accessing AI tools during the assessment itself. The architectural choice between browser-only proctoring and OS-level proctoring determines what's actually possible at this layer.

Browser-only proctoring covers the application layer - what's happening in the browser tab where the assessment runs. It can detect tab switches, copy-paste activity within the browser, focus changes, and some browser-extension AI tools. What it cannot detect, structurally: AI assistants running outside the browser (system-tray apps, OS-integrated services, desktop apps, keyboard utilities), AI usage on secondary devices (phone, tablet, secondary laptop), screen-share activity at the OS level, or virtual machines running the proctoring in a contained window while the host runs free.

OS-level proctoring operates at the operating-system layer - process detection, network-level enforcement, virtual machine detection, remote desktop detection, multi-monitor detection. It closes the architectural gap that browser-only proctoring cannot address. The 95% AI tool block rate Skolarli published from analysis of 50,000+ proctored assessments reflects the structural difference: OS-level integrity catches what browser-level integrity cannot see.

The operational decisions within this layer:

Decide the integrity tier required for your specific assessment context. Internal training quizzes don't need OS-level proctoring; competitive technical hiring assessments do. The integrity tier should match the consequence of an integrity failure. For consequential hiring decisions, OS-level integrity is the baseline. For lower-stakes screening, browser-level may be acceptable.

Decide the candidate experience tradeoffs explicitly. OS-level proctoring requires installation, has hardware compatibility constraints, and creates more candidate-side friction than browser-only. For high-stakes assessments where integrity matters substantially, the friction is justified. For lower-stakes assessments, browser-only delivery with weaker integrity may be the right operational choice.

Verify the proctoring covers the specific AI tools your candidate population is likely to use. Ask the proctoring vendor to name the specific AI assistants their system detects and blocks: ChatGPT desktop, ChatGPT browser extension, ChatGPT iOS keyboard, Copilot in Windows, Gemini in Chrome, Claude desktop, Perplexity, others. The specificity of the answer reveals the depth of the integrity infrastructure. Vendors who answer with vague "comprehensive AI detection" claims rarely have substantive coverage.

Verify the detection signature update cadence. New AI integration patterns emerge regularly. The detection signatures need to be updated continuously to keep pace. Ask the vendor how frequently signatures are updated and what the update process is. Vendors with weekly or biweekly updates are taking the arms race seriously; vendors with quarterly or ad-hoc updates are falling behind.

Plan for the failure cases. OS-level proctoring fails on some candidate devices - older hardware, locked-down corporate machines, candidates without administrative access. The operational sequence needs alternative paths for these candidates: scheduled testing on organisation-provided hardware, in-person assessment centres, alternative assessment formats. The exception path matters as much as the default path.

Layer 3 - Behavioural and pattern analysis post-submission

The third layer is the post-submission analysis that flags patterns suggesting AI assistance was used despite the integrity infrastructure. This layer catches the candidates who bypassed Layer 2 through methods the integrity infrastructure didn't cover, plus the cases where the integrity infrastructure was deliberately disabled or worked around.

The signals worth monitoring at this layer:

Typing pattern analysis. Human candidates type at variable speeds, with characteristic pauses for thinking, with typos and corrections. AI-assisted candidates often paste large code blocks at uncharacteristic speeds, with no typos, with no characteristic pauses. The pattern is detectable when monitored across the assessment session.

Time-on-question patterns. Strong human candidates spend characteristic time on each problem - initial reading, approach formulation, implementation, debugging, verification. AI-assisted candidates often spend uncharacteristically short time on hard problems (because AI solved them quickly) and uncharacteristically long time on easy problems (because they were waiting for AI output or trying to disguise the pattern). Time-on-question variance from typical patterns is a flag.

Code style consistency. Human candidates have characteristic coding styles - naming conventions, comment patterns, structural preferences. AI-assisted candidates often produce code with style inconsistencies across questions, because the AI's style and the candidate's style alternate as the candidate switches between AI assistance and manual implementation.

Explanation-implementation gap. Strong human candidates can explain their code at the level of detail they implemented it. AI-assisted candidates often produce code that's more sophisticated than their ability to explain it. Follow-up questions or live discussion exposes this gap.

Solution sophistication relative to question difficulty. Strong human candidates produce solutions calibrated to their experience level. AI-assisted candidates sometimes produce solutions that are uncharacteristically sophisticated relative to their stated experience - using language features, optimisation patterns, or design choices that don't match a candidate's typical level.

This layer should produce flags that surface to human reviewers, not auto-rejections. False positives at this layer are real - some strong human candidates have patterns that statistically resemble AI assistance - and auto-rejection on behavioural signals alone creates unfair candidate experience.

Layer 4 - Structured human review of flagged sessions

The fourth layer is the structured human review process that examines flagged sessions before consequential hiring decisions are made. This is the layer most teams under-invest in, even when the previous three layers are working.

The operational sequence for human review:

Define the flag threshold. Not every behavioural signal warrants human review. Establish a threshold - for example, flag for review when 3 or more behavioural signals exceed defined thresholds, OR when integrity infrastructure detected a blocked AI tool attempt. The threshold should produce a manageable review queue, not flag every candidate.

Establish a structured review protocol. Reviewers should follow a consistent protocol - examine the flagged signals, review the recorded session video where available, examine the code submission for the style and sophistication patterns, evaluate the time-on-question variances. The review should produce a documented decision with reasoning, not a gut-feel judgment.

Use multiple reviewers for consequential decisions. Where the integrity question affects whether a candidate advances to interview or receives an offer, two independent reviewers should evaluate the flagged session before the decision is made. Inter-reviewer agreement (or disagreement) becomes signal for the decision.

Surface the candidate's voice. For high-stakes decisions, offer the candidate the opportunity to discuss the flagged session before the decision is made. We noticed some patterns in your assessment that we'd like to understand better - can we schedule a brief conversation? Some candidates have legitimate explanations (accessibility tools that affect typing patterns, focus-mode applications that created tab-switch flags, slow internet connections that affected timing patterns). The conversation surfaces these cases and prevents false-positive rejections.

Document the review for audit. Every flagged session that becomes a hiring decision should have documented review trail - the flags, the reviewer evaluations, the decision, the reasoning. For regulated hiring contexts, this audit trail is contractually or legally required. For unregulated contexts, it's defensive infrastructure against discrimination claims and supports continuous improvement of the system.

Track the false-positive and false-negative rates. Over time, the review process produces data on how often flagged sessions resulted in confirmed AI assistance vs how often they were legitimate human work. This data should feed back into Layer 3 (refining the behavioural signals and thresholds) and Layer 1 (refining the question design to reduce the cases that produce ambiguous signals).

How the four layers actually work together

A coding assessment that holds up against AI-assisted candidates is one where all four layers are operationally functional. The failure modes when layers are missing or weak:

Layer 1 weak, Layers 2-4 strong: The assessment relies on integrity infrastructure and human review to catch AI-assisted candidates whose question-level performance is indistinguishable from strong human performance. Some candidates inevitably bypass the integrity infrastructure undetected, and post-submission analysis on AI-solvable questions doesn't produce reliable signal. The hiring team experiences mis-hires that the system couldn't have caught.

Layer 2 weak, Layers 1, 3, 4 strong: The assessment relies on question design and behavioural analysis to catch AI assistance. Some candidates produce uncharacteristically sophisticated answers that question design alone can't filter; the behavioural analysis catches some but not all of these. The hiring team experiences mis-hires from sophisticated AI-assisted candidates.

Layer 3 weak, Layers 1, 2, 4 strong: The assessment relies on integrity infrastructure and human review of submissions. Some candidates bypass the integrity infrastructure successfully and produce submissions that look superficially human; without behavioural signal analysis, these get past the review layer. The hiring team experiences mis-hires from candidates who successfully circumvented Layer 2.

Layer 4 weak, Layers 1, 2, 3 strong: The assessment generates good signals but they don't translate into good hiring decisions. Flagged sessions either auto-reject (producing unfair candidate experience and false-positive errors) or get rubber-stamped without serious review (producing false-negative errors). The hiring team experiences both bad hires and frustrated candidates whose legitimate assessments were treated unfairly.

The pattern: each layer addresses failure modes the other layers cannot catch. The discipline of operating all four layers consistently is what makes coding assessment integrity actually work - and it's substantially harder than operating any single layer well.

What to verify when operating this system

Six verification disciplines worth establishing:

1. Question library audit, quarterly. Review the question library against current AI capability. Questions that AI now solves well should be retired or repositioned (used for screening only, not for consequential decisions). New questions should be tested against current AI models before being added to the library.

2. Integrity infrastructure penetration testing, quarterly. Test the integrity infrastructure against current AI integration patterns. Run a controlled test: have a trusted person attempt to use AI during a proctored assessment using various methods. Verify which methods the system catches and which it doesn't. Update detection signatures and processes accordingly.

3. Behavioural signal tuning, monthly. Review the false-positive and false-negative rates from the behavioural signal layer. Adjust thresholds and signals based on the data. New AI usage patterns produce new behavioural signatures; the analysis layer needs continuous refinement.

4. Human review consistency, monthly. Audit the human review decisions for consistency across reviewers. Inter-reviewer disagreement on flagged sessions should be investigated - it usually signals that the review protocol needs sharpening or the flag thresholds need calibration.

5. Candidate-side fairness audit, quarterly. Review the demographic patterns of flagged candidates and the demographic patterns of confirmed AI assistance. Pattern divergence between who gets flagged and who actually used AI indicates fairness issues that need addressing - either in the behavioural signals (some signals may have demographic bias) or in the review process (some reviewers may have inconsistent judgment).

6. Outcome measurement, annually. Track whether candidates who passed the assessment perform as expected in the role. Performance gaps signal that the assessment is producing false negatives - candidates passing the assessment without the capability the assessment is supposed to measure. The data informs whether the system is actually working or producing aesthetic compliance without operational effect.

Where Skolarli's infrastructure fits this operational sequence

Skolarli's coding assessment platform and Skolarli Secure Browser are built around the four-layer model described above. Specifically:

Layer 1 (question design): Skolarli's question library is built around context-dependent, interactive, and judgment-heavy formats. Standard algorithmic puzzles are deprioritised in favour of debugging-format questions, multi-step reasoning sequences, and domain-calibrated problems. The library is reviewed quarterly against current AI capability.
Layer 2 (integrity infrastructure): Skolarli Secure Browser provides OS-level AI tool detection and blocking - ChatGPT, Claude, Copilot, Gemini, Perplexity, and the continuously updated AI tool signature library. Network-level enforcement, virtual machine detection, remote desktop detection, multi-monitor detection.
Layer 3 (behavioural analysis): SkoAI Proctor analyses typing patterns, time-on-question patterns, code style consistency, and other behavioural signals to produce a 0-100 trust score with severity-weighted violation logging.
Layer 4 (human review): Skolarli does not auto-reject candidates based on integrity signals. Flagged sessions surface to human reviewers with explainable evidence, supporting the human-in-the-loop discipline that distinguishes credible integrity infrastructure from rubber-stamp automation.

For hiring teams designing their own coding assessment integrity programmes, Skolarli's infrastructure handles the technical layers (2 and 3) and supports the operational layers (1 and 4) with the right tools and workflows. The operational discipline - the question library curation, the review protocol design, the threshold tuning, the outcome measurement - remains the customer's responsibility, because it depends on the customer's specific role context, hiring volume, and quality expectations.

Frequently asked questions

Can a coding assessment really hold up against current AI capability?

Yes, when operated across the four-layer model. No single layer is sufficient on its own. The combination of question design that resists AI, integrity infrastructure that prevents AI access, behavioural analysis that catches what infrastructure missed, and human review that examines flagged sessions produces an assessment that genuinely measures candidate capability.

Do we need to retire all standard algorithmic questions?

Not retire - reposition. Standard algorithmic questions still have value for screening, for testing foundational capability, and for assessment scenarios where consequential decisions don't depend on the result. For consequential hiring decisions, the question library should weight toward AI-resistant categories.

How do we know if our integrity infrastructure is actually working?

Penetration testing quarterly. Have a trusted person attempt AI-assisted solving using various integration patterns and measure what the system catches. Marketing claims aren't a substitute for verification. Vendors who resist penetration testing are signalling that the infrastructure isn't as robust as marketed.

What's the right false-positive rate for behavioural analysis?

Lower than most teams assume. False positives on the behavioural layer create candidate-experience issues and waste reviewer time. The threshold should be calibrated so that flagged sessions warrant the review investment - typically flagging 5-10% of submissions, not 30-40%.

Can we use AI ourselves to detect AI-assisted candidates?

Yes, with discipline. AI-based detection of AI-generated code is technically feasible and operationally useful as one of several behavioural signals. The discipline: don't auto-reject based on AI detection alone; combine with other signals; verify the detection model's false-positive rate continuously; surface flags to human reviewers rather than acting on them automatically.

How long does it take to build this system?

For a hiring team starting from a standard coding assessment with browser-level proctoring: roughly 8-12 weeks of operational work to migrate to the four-layer model - question library audit and rebuild, integrity infrastructure migration, behavioural analysis configuration, human review protocol design. For teams starting from no coding assessment at all, building this from scratch is roughly 16-20 weeks.

About this piece

This post opens the Skolarli Operator's Compass, an analytical series from Skolarli Akademy Research covering the operational disciplines for hiring and L&D practitioners running programmes in the AI era. The series follows the Skolarli Buyer's Compass - where Buyer's Compass covers buying decisions, Operator's Compass covers operational execution.

Skolarli Akademy Research is the editorial arm of Skolarli Edulabs Pvt. Ltd., publishing analysis on learning, hiring, and assessment infrastructure. Findings are reviewed by Skolarli's founders and product leaders before publication.

Reviewed by Jayalekshmy Nair, Co-founder & CTO, Skolarli.

Tags#operators-compass #hiring-execution #ai-resistant-assessment #coding-assessments