Competitor's Site Crawler


Turn a domain into a library of clean, AI-ready source pages — in about 15 to 30 minutes.

TL;DR

  • What it is — the pipeline that turns a domain into reusable source pages ready to feed into AI generation. Content Studio scrapes your or your competitor's pages, extracts the main content, and cleans it into structured form ready for blog generation.
  • Who it's for — marketers who need to seed generation with their own content or a competitor's; ops leads who need to tune rate limits for specific sites.
  • Top outcome — from "here's a domain" to "you have 50 clean source pages ready for the generator" in 15 to 30 minutes, running in the background.

At a glance

Plan tiersBundled with Lodgestory CRM on Growth and above.
Who can use itOwners and Admins create or edit sites. Editors can trigger scrapes. Viewers read-only.
What you produceClean source pages — title, body, summary, keywords, category, related URLs — stored in your organisation and ready to feed into generation.
Limits you'll seeUp to 100 pages per scrape run by default; per-site concurrency and delay settings for polite scraping.
APINot partner-facing. Triggered from the Sites page.

How to find it

Sites setup: left nav → SitesAdd Site. Enter a domain and optional sitemap URL and crawl seed URLs.

Run a scrape: click Scrape on any site row.

Check progress: the Sites page polls for live updates — discovered pages, scraped pages, cleaned pages, and errors.

Inspect sources: click the domain row to see every cleaned source page. Click any row for the source detail view with the clean body and metadata.

Direct URL: /sites for configuration, /sources/<domain> for the library of clean pages.

Screenshot [SCREENSHOT: scraping-nav.png — Sites page with Add Site and Scrape buttons outlined]

What is Sources & Scraping?

The problem it solves

AI generators are only as good as their inputs. Feed them a blank prompt and you get generic copy. Feed them a raw HTML dump and you waste tokens on nav bars, cookie banners, and ads. Scraping without discipline — no rate limits, no stealth, no handling for bot-challenge sites — gets your IP blocked, your account banned, or fills your library with half-downloaded pages.

Content Studio solves the full problem. It discovers URLs from your sitemap, fetches the pages politely, removes navigation and ads, extracts the main content, and produces a structured clean page with title, body, summary, keywords, category, and related URLs. That clean page is a first-class citizen the rest of Content Studio can generate from.

What you get

  • Sitemap-first discovery. Respects your site's own URL inventory. Falls back to crawling only when no sitemap is available.
  • Per-site rate limits. Concurrency and delay are site-level settings, so one picky site doesn't constrain another.
  • Non-blocking runs. Trigger a scrape and close the tab. The run continues in the background. The Sites page shows live progress when you come back.
  • Structured clean output. Every scraped page becomes a clean source with title, body (in structured form), summary, keywords, category, and related URLs — not just a blob of text.
  • Audit trail. Every scrape run is recorded with counts, timing, and errors.

How it's different

  • Content-first output. Structural extraction plus an AI cleaning pass handles both layout and editorial cleanup in one step. Single-stage regex produces dirty output; single-stage AI is expensive.
  • Per-site bot-challenge flag. Sites that use Cloudflare or similar challenge pages are configured with a toggle, so the scraper boots the right way for that site.
  • Include/exclude URL patterns. For large sites, scope the scrape to /blog/* or /properties/* without wasting fetches on other paths.
  • Per-site model override. Clean cheaper blogs with a fast model; use a premium model for sources that need denser extraction.

Customer scenarios

  1. "Seed generation from our own pages." Add example.com. Paste your sitemap URL. Click Scrape. In about 30 minutes you have 50 clean source pages, each ready to feed /generate/recreate or /generate/suggest-new in the editor.
  2. "Analyse a competitor listing." Add competitor.com with one page in the crawl seed URLs. Scrape. One clean source appears. Generate angles from it.
  3. "Incremental refresh." Re-run the scrape weekly. Already-scraped pages get skipped unless they changed; new pages get picked up.
  4. "Handle a bot-challenge site." Edit the site config and flip the Uses bot challenge toggle on. The scraper adjusts its behaviour to work through the challenge.

How it fits with the rest of Lodgestory

  • Downstream: every source in AI Content Generation is a clean page produced here.
  • Sideways: the Editor launches generation from any clean source.

Screenshot [SCREENSHOT: scraping-landing.png — Sites page with multiple domains, status chips, last-run timestamps]

Core concepts

TermWhat it means
SiteA domain configuration — your domain, a competitor's, anything you've added. Holds sitemap URLs, crawl seeds, URL patterns, and rate-limit settings.
Sitemap URLThe address of the site's machine-readable URL list. The scraper fetches this first to discover what pages exist.
Crawl seed URLA starting point for crawling when no sitemap is available (or to supplement the sitemap). The scraper walks from here and follows same-domain links.
URL patternRegex include/exclude filters applied during discovery. Essential for large sites where you want only /blog/*.
SourceA clean page produced by the scraper. The input for AI generation.
Clean pageThe scraper's structured output: title, body, summary, keywords, category, and related URLs.
RunOne scrape operation over a site. Has counts (discovered, scraped, cleaned) and a start/end time.
ConcurrencyHow many pages are fetched in parallel for this site. Default is 2.
DelayTime between requests in milliseconds, with a small random jitter. Default is 3,000 ms.
Uses bot challengePer-site toggle. Flip on for sites that use bot-challenge pages.
ActiveWhether the site is enabled for scraping. Toggling off retires a site without deleting its sources.

Quick Start — scrape a domain end-to-end

Step 1 — Add the site

Sites → Add Site.

  • Domain: example.com
  • Sitemap URL: https://example.com/sitemap.xml
  • Include patterns: for example, ^https://example\.com/blog/
  • Exclude patterns: anything you want to skip.
  • Uses bot challenge: off (flip on if the site uses one).
  • Active: on.

Save.

Screenshot [SCREENSHOT: scraping-qs-1-add.png]

Step 2 — Trigger the scrape

Click Scrape on the row. The run starts in the background; you're free to leave the page.

Screenshot [SCREENSHOT: scraping-qs-2-start.png]

Step 3 — Watch progress

The Sites page polls for live updates — discovered, scraped, cleaned, errors. Progress is typically visible within seconds of starting.

Step 4 — Inspect sources

After about 15 to 30 minutes, click the domain row. You see the list of every clean source page produced by this and previous runs.

Step 5 — Open a source

Click a row. The source detail view shows the cleaned body, summary, keywords, category, and related URLs.

Screenshot [SCREENSHOT: scraping-qs-5-source.png — source detail with body, keywords, Recreate and Suggest new buttons]

Step 6 — Generate a blog from it

Click Recreate from this source or Suggest new blogs. See AI Content Generation for the full flow.

How it works

The scraper runs in three phases.

  1. Discovery. It fetches your sitemap (or crawls from your seed URLs), collects candidate pages, and applies your include/exclude patterns to filter down to what you want.
  2. Fetching. For each URL, the scraper loads the page politely — respecting your per-site concurrency and delay. It handles modern JavaScript rendering, lazy-loaded content, and bot-challenge pages.
  3. Cleaning. The main content is extracted from the rendered HTML. An AI cleaning pass strips any remaining chrome, converts the body into structured form, and extracts keywords, category, and related URLs.

At the end, you have a library of clean source pages — one per scraped URL — ready to feed into generation. Already-cleaned pages are skipped on future runs unless they've changed, so re-running the scrape is cheap and incremental.

If the server restarts mid-run, the run tracker resets, but the audit record is preserved. Re-trigger the scrape to resume.

flowchart LR
    A[Add Site] --> B[Configure sitemap / patterns]
    B --> C[Trigger Scrape]
    C --> D[Discover → Fetch → Clean]
    D --> E[Clean source pages]
    E --> F[Available in Editor for generation]

Features in depth

Site configuration

Create a site, then edit anytime. Key fields:

  • Sitemap URLs — the recommended discovery path. Respects your site's URL inventory.
  • Crawl seed URLs — starting points when you don't have a sitemap, or to cover pages outside it.
  • URL patterns — regex include/exclude. Essential for large sites. Example include: ^https://example\.com/blog/.
  • Concurrency and delay — tune for polite scraping. Conservative defaults of 2 concurrent requests and 3,000 ms between them work for most sites.
  • Uses bot challenge — flip on for sites that present bot-challenge pages.
  • Model — the AI model used for cleaning. A fast default is fine for most sites; premium models help for dense or unusually formatted content.
  • Active — off retires a site without losing its sources.

On-demand scrape

Click Scrape on any site row. The run starts immediately in the background. You can close the page; the run keeps going.

Tip: to force a full re-scrape of a changed domain, ask support to reset processed flags on the URLs before running. Otherwise, already-cleaned pages are skipped.

Live status

The Sites page polls for live updates. For each site you see:

  • Discovered — how many URLs were found.
  • Scraped — how many pages were fetched.
  • Cleaned — how many have been converted into clean sources.
  • Errors — if any fetches failed, the count and the reasons.

Per-domain stats

Each site shows aggregate counts — "50 of 63 pages processed" — so you can see completeness at a glance without diving into run history.

Source library

Sources → pick a domain gives you the list of every clean source page for that domain. Click any row for the source detail view.

The detail view shows:

  • The cleaned body in structured form.
  • Summary, keywords, category.
  • Related URLs — the AI's suggestions for further pages that might be worth scraping next. Paste the good ones into your crawl seed URLs.

Suggestions per source

Each source remembers every suggestion ever generated from it. Useful to avoid re-running Suggest new blogs on the same source and duplicating work.

Tuning a site for best results

Pick the right discovery strategy

  • Use a sitemap whenever one exists. It's faster, cleaner, and respects your site's published URL inventory.
  • Use crawl seeds when there's no sitemap or when you want to include pages the sitemap doesn't list (for example, a competitor's homepage that then links out to their listings).
  • Combine both when the sitemap is partial. The scraper deduplicates so there's no risk of double-counting.

Narrow the scope with patterns

Large sites have hundreds or thousands of URLs. Most of them aren't the content you want to feed into generation. URL patterns let you include only the paths that matter:

  • Blog posts: ^https://example\.com/blog/
  • Listings: ^https://example\.com/properties/
  • Specific categories: ^https://example\.com/blog/(weddings|corporate-events)/

And exclude the noise:

  • Tag and author archives: /tag/, /author/
  • Paginated indexes: /page/\d+
  • Utility pages: /privacy, /terms

Tune rate limits for the target

The default is conservative — 2 concurrent requests with a 3-second delay. For your own sites you can usually go faster. For sites that respond slowly or that may have rate limits, stay conservative.

  • Your own low-traffic domain: 4 concurrent, 1,500 ms delay.
  • Your own production marketing site: stay at defaults — don't risk affecting your visitors.
  • A competitor site: 1 concurrent, 5,000 ms delay. Be a polite guest.

Refreshing source pages

Sources aren't static — the world changes and so do the underlying pages. Re-running a scrape is incremental: already-cleaned pages are skipped, so you're only fetching and cleaning new pages. If you specifically need to refresh pages that have changed, contact support to mark them for re-processing.

Weekly or monthly re-scrapes work well for most sites. For fast-moving content (news, listings), you may want tighter cycles.

Handling bot-challenge sites

Some sites present a bot-challenge page before serving real content. Flip the Uses bot challenge toggle on for those sites, and the scraper adjusts its behaviour to work through the challenge. Not every bot-challenge setup is handleable; aggressive ones may still block. If so, the error log for the run shows you what happened.

Choosing a cleaning model

The default cleaning model is a fast general-purpose AI that works well for most content. Override to a premium model for:

  • Sites with unusual HTML where the default struggles to extract the main content.
  • Dense technical pages where you want thorough keyword and category extraction.
  • Multilingual sites where the default may miss nuances.

For most sites, the default is fine. Save the premium model for the ones that need it.

Roles & permissions

ActionOwnerAdminEditorViewer
Create or edit siteYesYesNoNo
Trigger scrapeYesYesYesNo
View sourcesYesYesYesYes
View scrape statusYesYesYesYes
Retire site (set inactive)YesYesNoNo

Connections — cross-module workflows

flowchart LR
    A[Add Site] --> B[Configure sitemap / patterns]
    B --> C[Trigger Scrape]
    C --> D[Discover → Fetch → Clean]
    D --> E[Clean sources]
    E --> F[Sources list]
    F --> G[Editor: Recreate or Suggest new]
    G --> H[Generated blogs]

What this module reads

  • Your sitemap and pages from the target domains.
  • Your organisation's per-site config and patterns.

What this module produces

  • Clean source pages consumable by AI Content Generation.
  • Audit records of every scrape run with counts and timing.

Limits you'll see

LimitDefault
Concurrency per site2
Delay between requests3,000 ms with a small random jitter
Max pages per run100
Max URLs discovered (crawl)1,000
Max crawl depth5
Page fetch timeout30 seconds
Retry attempts per URL3
AI cleaning model (default)Fast general-purpose; override per site if you need denser cleaning

Errors & FAQ

Common situations

SymptomWhat to do
"Sitemap fetch failed."The site may block sitemap fetches without specific headers. Add the sitemap URL to the crawl seed URLs instead, so the scraper walks from there.
Scrape repeatedly times out on a site.The target is slow or behind heavy JavaScript. Lower concurrency; flip Uses bot challenge on if applicable.
Scrape returns 0 URLs.Your include pattern is too strict. Loosen the regex; verify by opening the sitemap URL in a browser first.
Pages are scraped but never cleaned.Your AI credits may be exhausted, or the cleaning model is throttled. Contact support.
The clean body is empty or mangled.The site has unusual DOM structure. Try a different cleaning model for that site.
Every request is blocked.A bot-challenge system is aggressive. Flip Uses bot challenge on. If still blocked, scraping may not be possible without explicit permission from the site owner.
URL pattern isn't matching.Escape issues in the regex. Test it in a regex sandbox first.
Scrape looks stuck.Check the Sites page for the last update timestamp. If no movement for more than 10 minutes, contact support — the run tracker may need to be reset.

FAQ

  • How do I force a re-scrape of already-processed pages? There's no UI toggle today. Contact support to reset the processed flags on specific URLs.
  • Can I scrape a single page without crawling? Yes. Add a site with only that URL in the crawl seed URLs, an empty sitemap, and a tight include pattern.
  • Does the scraper respect robots.txt? The scraper does not fetch robots.txt explicitly. Respectful behaviour comes from the per-site delay and your own judgement about which sites to point it at.
  • Are there anti-crawl ethics I should know about? Treat the scraper as "I own this site or have explicit permission." Don't point it at sites whose terms forbid scraping.
  • Can I export all sources as a ZIP? Not in the UI today.
  • Why are some URLs discovered but never fetched? They may violate your exclude patterns, or they may already exist as cleaned sources from a prior run.

Changelog

  • Apr 2026 — Per-site model override for AI cleaning.
  • Mar 2026Uses bot challenge per-site toggle.
  • Feb 2026 — Structured clean output — every source has title, body, summary, keywords, category, and related URLs.
  • Jan 2026 — Non-blocking runs with live progress.

Related modules & next steps