The Web Application Penetration Testing Methodology We Use

By cyberxploreAugust 5, 202410 min read

A senior tester walks through the web application penetration testing methodology we run: recon, auth, access control, injection, logic, and retest.

The Web Application Penetration Testing Methodology We Use

The best bug I found last quarter was not on the scanner report. It surfaced in the dull hour after the scan finished, on the fourth pass through a checkout flow, when I noticed the order_id in a confirmation request was a plain incrementing integer. Subtract one. Send it again. Someone else’s shipping address came back. That gap – between what a tool flags and what an attacker actually does with it – is the whole reason our web application penetration testing methodology is built the way it is.

What follows is the process we run on real engagements, not a slide from a sales deck. The order flexes depending on the app. The shape holds. Every stage feeds the next.

Key takeaways

A real web application penetration testing methodology is manual-led. Scanners handle coverage and noise; humans find the access control and business logic flaws that tools structurally cannot see.
We work in stages: scope and recon, mapping, authentication and session, access control, injection, business logic, then reporting and a retest to confirm the fixes actually landed.
The highest-impact findings are usually broken access control (IDOR) and logic abuse, not exotic memory corruption. These map to OWASP Top 10 A01 and stay the most common criticals we write up.
A finding without a reproducible proof of concept and a specific fix is half a finding. The report is the product.
Retesting matters as much as testing. A ticket marked “fixed” that nobody re-checked is a common way a live vulnerability ships to production.

What web application penetration testing actually is

Web application penetration testing is authorized, goal-driven attack simulation against a specific application, run by a human using the same tools and mindset a real attacker would. The point is not a list of theoretical weaknesses. The point is to prove which ones can be exploited, how far they reach, and what a determined outsider or a malicious logged-in user could actually get to.

That distinction drives everything else. A scan asks, does this pattern exist? A pentest asks, can I turn this into access to data or money? The second question is harder. It does not automate well. And that is exactly where the value sits.

Stage 1: Scope, rules of engagement, and recon

Before a single request goes out, we pin down scope in writing. Which hostnames and subdomains are in play. Which environment – we push hard for a staging mirror over live customer records. Which roles we get, whether we test as an unauthenticated outsider, a low-privilege user, or both, and which endpoints are off limits. Rules of engagement are not paperwork. Getting test accounts at two different privilege levels is often the single factor that decides whether we can prove an access control bug at all.

Recon then maps the attack surface. We enumerate subdomains, pull the JavaScript bundles and parse them for hidden API routes and hardcoded keys, read any Swagger or OpenAPI definition, and fingerprint the stack. A modern single-page app will happily leak its entire backend contract in a source-mapped JS file. Reading that file carefully beats guessing every time.

Stage 2: Mapping the application

Now we drive the app by hand with Burp Suite proxying everything. Every role, every workflow, every state change. Sign up, log in, reset a password, upload a file, buy something, invite a teammate, change a role, delete an object. The proxy history from a thorough walk-through becomes the map we attack from.

For the paths the UI never links to, we run ffuf and Burp’s crawler for content and parameter discovery. Old admin panels. Debug routes. A forgotten /api/v1 that never got the authorization checks someone bolted onto v2. That last one comes up constantly.

ffuf -w wordlist.txt -u https://app.example.com/api/FUZZ \
  -H "Authorization: Bearer <low-priv-token>" \
  -mc 200,201,401,403 -o results.json

Stage 3: Authentication and session management

Auth gets its own workstream, because getting it wrong is catastrophic and getting it subtly wrong is everywhere. We test password reset flows for token predictability and host-header poisoning. We check whether sessions actually die on logout and password change, or just pretend to. We look hard at JWT handling: algorithm confusion, missing signature verification, alg: none, tokens that never expire. And we probe multi-factor flows for the classic gap where the second factor is enforced in the UI but not on the API call underneath it.

Something we see on a depressing number of apps: rate limiting on the login form, none on the JSON endpoint behind it. If the front end throttles and the API shrugs, credential stuffing strolls right in.

Stage 4: Access control – where the criticals live

Broken access control is OWASP Top 10 A01, and it is the category that produces most of our critical findings. The classic is IDOR (insecure direct object reference, CWE-639): an object identifier in a request that the server trusts without ever checking whether the current user owns that object.

This is why two test accounts matter. We capture a request as user A, replay it as user B by swapping only the object ID or the session token, and watch whether user B walks off with user A’s data.

GET /api/v2/invoices/48213 HTTP/1.1
Host: app.example.com
Authorization: Bearer <user-B-token>

# Invoice 48213 belongs to user A.
# A 200 with A's data means IDOR.

We also test vertical escalation (can a standard user hit an admin-only function by calling the endpoint directly?), mass assignment (does adding "role":"admin" to a profile-update body actually stick?), and forced browsing to functions the menu hides but the server still serves. Scanners miss nearly all of this. They have no concept of who is supposed to own what. That context lives in a tester’s head.

Stage 5: Injection and server-side flaws

Injection is well understood, and it has not gone anywhere. It moved. We still probe for SQL injection with manual and automated payloads, but the hours now go into the variants frameworks did not solve for you: server-side template injection, SSRF where a URL parameter makes the server fetch internal resources (a straight line to cloud metadata endpoints), XML external entity processing on anything that still ingests XML, and command injection hiding in file processing or export features.

Cross-site scripting is still everywhere, especially DOM-based XSS in SPAs where a value flows from the URL into innerHTML or a framework sink with no sanitization. We confirm every candidate by hand. A reflected value in a response proves nothing. Execution in a browser proves it.

Stage 6: Business logic – the part tools cannot do

Business logic testing is where experience separates a real assessment from a scan-and-dump PDF. There is no signature for “the coupon applies twice,” or “you can skip the payment step and still land on the paid tier,” or “the quantity field accepts a negative number and credits your account.” Those only become findings when a human understands what the app is meant to do, then deliberately breaks the assumption underneath it.

On engagements we routinely find race conditions in redemption and withdrawal flows – fire the same request twenty times in parallel and see if a one-time action fires twice. Workflow steps that can be reordered or skipped. Limits enforced only on the client. This category rarely shows up in automated output, and it is often where the biggest real-world damage sits.

Stage 7: Reporting and the retest

The report is the deliverable, so we write it for two readers at once. Executives get a risk-ranked summary in business terms. Engineers get a per-finding writeup: severity (we score with CVSS but always temper it against real exploitability in your context), step-by-step reproduction, the exact request or payload, and a concrete fix. Not “sanitize input.” Which check to add, and where.

Then we retest. Once your team ships fixes, we re-run the exact attacks that worked and confirm they are dead, and we check we did not open a hole next door. You cannot word your way out of a finding; the payload either fires or it does not. A remediation nobody verified is how a “closed” vulnerability quietly reaches production.

How CyberXplore helps

This methodology is exactly what we run under our web application penetration testing service: senior testers, manual-led work, a report your engineers can act on, and a retest included so you are paying for fixes that hold, not a PDF that gathers dust. Want scoping help or a straight answer on effort and timeline for your app? Get a quote and we will walk your surface with you before you commit to anything.

FAQ

How long does a web application penetration test take?

For a typical mid-sized application, plan on roughly one to two weeks of active testing plus a few days for reporting. It scales with the number of roles, endpoints, and workflows in scope. Apps with heavy business logic or many privilege tiers take longer, because those are the parts you cannot rush. Scoping the app up front is what keeps the estimate honest.

What is the difference between a scan and a penetration test?

A vulnerability scan is automated pattern matching that reports known signatures and generates plenty of false positives. A penetration test is a human attacking the app to prove real, exploitable impact, including the access control and business logic flaws scanners cannot detect. We use scanners inside a pentest for coverage. They are the start of the work, not the whole of it.

Do you need production access or a staging environment?

We prefer a staging or pre-production environment that mirrors production behavior, ideally seeded with test data rather than real customer records. That lets us test aggressively without risking live data or uptime. If only production is available, we adjust the rules of engagement to test safely and avoid destructive actions.

Why do you ask for multiple user accounts and roles?

Because most critical findings are access control failures, and you cannot prove one without two perspectives. Testing as user A and user B confirms whether the server actually checks ownership. Testing a low-privilege account against admin functions exposes vertical escalation. Without those accounts we can suspect these bugs but not demonstrate them.

Is a retest included?

Yes. After your team remediates, we re-run the attacks that succeeded, confirm the fixes hold, and update the report status. A finding is not truly closed until someone has verified the fix works, and that verification is part of the engagement rather than an add-on.

Does this methodology cover APIs and single-page apps?

Yes. A modern web app is usually an SPA talking to a JSON or GraphQL API, so the API is where much of our time goes. We parse the front-end bundles to map the backend contract, then test the API endpoints directly for authorization, injection, and logic flaws instead of trusting whatever the UI happens to expose.

← Back to all articles