Website Content Crawler
By asermnasr
SCRAPEAI
descriptionDescription
This tool extracts high-signal data from webpages to provide a clean, noise-free text stream optimized for training LLMs and powering AI applications.
data_objectVariables
urlstring
https://doppelgangerdev.com
settingsBehavior & Action Config
Wait Time (ms)0
Rotate UAfalse
Rotate Proxiesfalse
Rotate Viewportfalse
Human Typingfalse
Shadow DOMtrue
Disable Recordingfalse
Stateless Execfalse
Stealth Features
fatigue
allowTypos
deadClicks
overscroll
idleMovements
naturalTyping
account_treeAutomation Steps
No explicit actions defined in configuration.
outputExpected Output
{
"text": "Browser automation fordevelopersenterprisesfreelancersteamshobbyistsdevelopers\nDoppelganger runs on your hardware while giving you the power of a visual builder with block-based actions, optional JavaScript, and a secure API.\nGet Startedarrow_forwardView Documentation\nCapabilities\nBuild automation tasks locally, step by step.\nA block-based editor, multiple run modes, and exportable task definitions you can reuse.\nview_quilt\nBlock-Based Builder\nCreate tasks by stacking action blocks inside a visual editor.\nlayers\nModes: Scrape, Agent, Headful\nChoose the right execution mode per task from the dashboard.\nplay_circle\nAction Blocks\nClick, type, hover, wait, scroll, press, and run JavaScript.\ncode\nJavaScript Blocks\nAdd custom extraction and page logic where you need it.\noutput\nJSON Export\nCopy task definitions from the editor for reuse.\nlock\nSecure API Access\nTrigger and manage tasks through a secure API.\nUse Cases\nAutomation you can actually ship to production.\nFrom monitoring pages to QA checks, every flow runs on your hardware and can be accessed via the secure API.\nExplore all use cases arrow_forward\nPrice Monitoring\nTrack public pricing pages and collect the latest values.\nLead Enrichment\nCollect public signals and structure them for your own records.\nQA Regression\nRun scripted QA flows across environments and compare outputs.\nSecurity Testing\nSimulate real flows in a controlled, local environment.\nModes\nChoose the execution mode per task.\nScrape, Agent, and Headful modes are available from the task editor and the secure API.\nScrape\nFast extraction mode for straightforward data pulls.\ndownload\nAgent\nMulti-step flows with block sequencing and variables.\nsmart_toy\nHeadful\nManual, interactive browser session with no automation blocks.\nvisibility\nAPI\nSecure API access to run and manage tasks.\nTrigger tasks, monitor runs, and fetch results from a secure API alongside the dashboard.\nbolt\nTrigger Runs\nStart tasks programmatically through the secure API.\nsync\nCheck Status\nRead run progress and statuses without leaving your stack.\ndata_object\nFetch Results\nPull structured outputs directly from the API.\nView API Documentationarrow_forward\nterminal.sh\ncurl -X POST \"http://localhost:11345/tasks/:id/api\" \\\n -H \"x-api-key: dpl_9d2f...\" \\\n -d \"{\"target_url\": \"https://example.com\"}\""
}codeExtraction Script
const html = $$data.html();
const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/html');
// 1. Remove "Noise" (Nav, Footer, Ads, Sidebars, AND Tool UI)
const junkSelectors = [
'script', 'style', 'noscript', 'iframe', 'svg', 'meta', 'link',
'nav', 'footer', 'header', 'aside',
'[role="navigation"]', '[role="banner"]', '[role="contentinfo"]',
'.sidebar', '.menu', '.navigation', '.nav', '.footer', '.ad', '.ads',
'#dg-cursor-overlay', '#dg-click-dot', '#SFXInjectableContainer' // Exclude known tool overlays
];
junkSelectors.forEach(selector => {
doc.querySelectorAll(selector).forEach(el => el.remove());
});
// 2. Focus on Main Content
const contentRoot = doc.querySelector('main, article, #content, .content, .main, #main-content') || doc.body;
// 3. Improve Text Readability
contentRoot.querySelectorAll('p, h1, h2, h3, h4, h5, h6, li, div, br').forEach(el => {
el.after(doc.createTextNode('\n'));
});
// 4. Extract Text
let text = contentRoot.textContent
.replace(/[\r\t]+/g, ' ')
.replace(/[ ]+/g, ' ')
.replace(/\n\s*\n/g, '\n')
.trim();
// 5. FILTER OUT LOGS & UI ARTIFACTS
// This removes lines that match the tool's log format
const logPatterns = [
/^›/, // "› Scrolling page..."
/^Clicking:/, // "Clicking: a"
/^Pressing key:/, // "Pressing key: Enter"
/^boltRun/, // UI Button text
/^play_arrow/, // UI Icon text
/^localhost/ // UI Header text
];
text = text.split('\n')
.filter(line => {
const trimmed = line.trim();
// Keep the line if it DOESN'T match any of the log patterns
return !logPatterns.some(pattern => pattern.test(trimmed));
})
.join('\n');
return JSON.stringify({
text: text
}, null, 2);