Icon

Website Content Crawler

By asermnasr

SCRAPEAI

descriptionDescription

This tool extracts high-signal data from webpages to provide a clean, noise-free text stream optimized for training LLMs and powering AI applications.

data_objectVariables

urlstring

https://doppelgangerdev.com

settingsBehavior & Action Config

Wait Time (ms)0
Rotate UAfalse
Rotate Proxiesfalse
Rotate Viewportfalse
Human Typingfalse
Shadow DOMtrue
Disable Recordingfalse
Stateless Execfalse

Stealth Features

fatigue
allowTypos
deadClicks
overscroll
idleMovements
naturalTyping

account_treeAutomation Steps

No explicit actions defined in configuration.

outputExpected Output

{
  "text": "Browser automation fordevelopersenterprisesfreelancersteamshobbyistsdevelopers\nDoppelganger runs on your hardware while giving you the power of a visual builder with block-based actions, optional JavaScript, and a secure API.\nGet Startedarrow_forwardView Documentation\nCapabilities\nBuild automation tasks locally, step by step.\nA block-based editor, multiple run modes, and exportable task definitions you can reuse.\nview_quilt\nBlock-Based Builder\nCreate tasks by stacking action blocks inside a visual editor.\nlayers\nModes: Scrape, Agent, Headful\nChoose the right execution mode per task from the dashboard.\nplay_circle\nAction Blocks\nClick, type, hover, wait, scroll, press, and run JavaScript.\ncode\nJavaScript Blocks\nAdd custom extraction and page logic where you need it.\noutput\nJSON Export\nCopy task definitions from the editor for reuse.\nlock\nSecure API Access\nTrigger and manage tasks through a secure API.\nUse Cases\nAutomation you can actually ship to production.\nFrom monitoring pages to QA checks, every flow runs on your hardware and can be accessed via the secure API.\nExplore all use cases arrow_forward\nPrice Monitoring\nTrack public pricing pages and collect the latest values.\nLead Enrichment\nCollect public signals and structure them for your own records.\nQA Regression\nRun scripted QA flows across environments and compare outputs.\nSecurity Testing\nSimulate real flows in a controlled, local environment.\nModes\nChoose the execution mode per task.\nScrape, Agent, and Headful modes are available from the task editor and the secure API.\nScrape\nFast extraction mode for straightforward data pulls.\ndownload\nAgent\nMulti-step flows with block sequencing and variables.\nsmart_toy\nHeadful\nManual, interactive browser session with no automation blocks.\nvisibility\nAPI\nSecure API access to run and manage tasks.\nTrigger tasks, monitor runs, and fetch results from a secure API alongside the dashboard.\nbolt\nTrigger Runs\nStart tasks programmatically through the secure API.\nsync\nCheck Status\nRead run progress and statuses without leaving your stack.\ndata_object\nFetch Results\nPull structured outputs directly from the API.\nView API Documentationarrow_forward\nterminal.sh\ncurl -X POST \"http://localhost:11345/tasks/:id/api\" \\\n  -H \"x-api-key: dpl_9d2f...\" \\\n  -d \"{\"target_url\": \"https://example.com\"}\""
}

codeExtraction Script

const html = $$data.html();
const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/html');

// 1. Remove "Noise" (Nav, Footer, Ads, Sidebars, AND Tool UI)
const junkSelectors = [
    'script', 'style', 'noscript', 'iframe', 'svg', 'meta', 'link',
    'nav', 'footer', 'header', 'aside', 
    '[role="navigation"]', '[role="banner"]', '[role="contentinfo"]',
    '.sidebar', '.menu', '.navigation', '.nav', '.footer', '.ad', '.ads',
    '#dg-cursor-overlay', '#dg-click-dot', '#SFXInjectableContainer' // Exclude known tool overlays
];
junkSelectors.forEach(selector => {
    doc.querySelectorAll(selector).forEach(el => el.remove());
});

// 2. Focus on Main Content
const contentRoot = doc.querySelector('main, article, #content, .content, .main, #main-content') || doc.body;

// 3. Improve Text Readability
contentRoot.querySelectorAll('p, h1, h2, h3, h4, h5, h6, li, div, br').forEach(el => {
    el.after(doc.createTextNode('\n'));
});

// 4. Extract Text
let text = contentRoot.textContent
    .replace(/[\r\t]+/g, ' ')
    .replace(/[ ]+/g, ' ')
    .replace(/\n\s*\n/g, '\n')
    .trim();

// 5. FILTER OUT LOGS & UI ARTIFACTS
// This removes lines that match the tool's log format
const logPatterns = [
    /^›/,                   // "› Scrolling page..."
    /^Clicking:/,           // "Clicking: a"
    /^Pressing key:/,       // "Pressing key: Enter"
    /^boltRun/,             // UI Button text
    /^play_arrow/,          // UI Icon text
    /^localhost/            // UI Header text
];
text = text.split('\n')
    .filter(line => {
        const trimmed = line.trim();
        // Keep the line if it DOESN'T match any of the log patterns
        return !logPatterns.some(pattern => pattern.test(trimmed));
    })
    .join('\n');

return JSON.stringify({
    text: text
}, null, 2);

Preset Details

Target URL

{$url}

Time Estimate

Highly variable

Created

2/21/2026

Configuration