Skip to content

da-troll/DOM-Daddy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

██████╗░░█████╗░███╗░░░███╗░░██████╗░░█████╗░██████╗░██████╗░██╗░░░██╗
██╔══██╗██╔══██╗████╗░████║░░██╔══██╗██╔══██╗██╔══██╗██╔══██╗╚██╗░██╔╝
██║░░██║██║░░██║██╔████╔██║░░██║░░██║███████║██║░░██║██║░░██║░╚████╔╝░
██║░░██║██║░░██║██║╚██╔╝██║░░██║░░██║██╔══██║██║░░██║██║░░██║░░╚██╔╝░░
██████╔╝╚█████╔╝██║░╚═╝░██║░░██████╔╝██║░░██║██████╔╝██████╔╝░░░██║░░░
╚═════╝░░╚════╝░╚═╝░░░░░╚═╝░░╚═════╝░╚═╝░░╚═╝╚═════╝░╚═════╝░░░░╚═╝░░░

                   E X T R A C T   A N Y T H I N G
                          by Trollefsen Labs

A Manifest V3 Chrome extension that extracts structured data from sites that fight scraping. Currently:

  • ChatGPT, Claude, Gemini, AI Studio, Perplexity chats → Markdown / Text / JSON / CSV
  • LinkedIn experience pages (/in/{you}/details/experience/) → Markdown / JSON / CSV (one row per role)

Pure client-side. No server, no build step, no analytics, no dependencies.

Install (unpacked)

  1. Open chrome://extensions.
  2. Toggle Developer mode on (top-right).
  3. Click Load unpacked and select the extension/ directory in this repo.
  4. Pin the extension to the toolbar.
  5. Open a supported page, click the icon, choose a format.

For LinkedIn specifically: open your (or another user's) profile, click Show all experience, and run DOM Daddy on the resulting …/details/experience/ page. The popup will offer a "Take me there" shortcut if you click it on the wrong sub-page.

Works on any Chromium browser (Chrome, Edge, Brave, Arc, etc.).

Architecture

extension/
  manifest.json
  src/
    background/            Thin service worker (lifecycle hooks only)
    content/               One extractor per supported host
      chatgpt.js  claude.js  gemini.js  aistudio.js  perplexity.js
      linkedin.js          /in/{slug}/details/experience/
    lib/
      schema.js            Conversation + Profile types (kind discriminator)
      markdown.js          HTML -> Markdown converter (no deps)
    exporters/
      exporters.js         export{Markdown,Text,JSON,CSV} for conversations
                           export{ProfileMarkdown,ProfileJSON,ProfileCSV} for profiles
    popup/
      popup.html / .css / .js   User-facing UI; branches on result kind
  icons/

Data flow

Popup opened
  -> popup.js sends { type: 'EXTRACT' } to the active tab
  -> matching content script returns { kind, ...data }
  -> popup branches on kind, runs the right exporter
  -> chrome.downloads delivers the file

The shared schema (extension/src/lib/schema.js) is the contract between extractors and exporters. Two shapes today: Conversation (kind: 'conversation') and Profile (kind: 'profile').

Why dynamic import() in content scripts?

MV3 doesn't allow content scripts to be declared as ES modules. To still share markdown.js and schema.js across extractors, each content script does await import(chrome.runtime.getURL('src/lib/...js')). The shared files are listed under web_accessible_resources so they're loadable from the page context.

Known limitations

  • Selectors drift. When a site reorganizes, only that site's content script needs to change. Stable anchors:
    • ChatGPT: [data-message-author-role], [data-message-id]
    • Claude: [data-testid="user-message"], .font-claude-message
    • Gemini: user-query, model-response (Angular component tags)
    • AI Studio: ms-chat-turn (role from data-turn-role); content in ms-cmark-node
    • Perplexity: [class~="group/query"] for queries, .prose for answers
    • LinkedIn: [componentkey^="entity-collection-item-"] per company entry; we parse innerText line-by-line and ignore hashed CSS classes entirely.
  • Virtualized / collapsed UIs lose data. Chat sites unmount off-screen messages; "Show thinking" details and LinkedIn's …see more may collapse content. Scroll/expand before extracting.
  • LinkedIn +N skills. The "+N skills" overflow on roles can't be pulled without clicking the chip — we capture the visible skills and store the hidden count as hiddenSkillCount.
  • Canvas / Artifacts (ChatGPT side panel, Claude artifacts) aren't currently captured.
  • No real chat timestamps. None of the chat hosts expose creation date or per-message timestamps in the DOM, so the date in the filename is the export date.

Filenames and download-manager extensions

Conversations: {source}-YYYYMMDD-{sessionId}.{ext}. Profiles: {source}-{slug}-YYYYMMDD.{ext}.

If the Save As dialog shows a different filename than what we suggested, another installed extension is hooking chrome.downloads.onDeterminingFilename and overwriting our suggestion (Chrome only honors the most recently installed listener — there's no override). The popup's green "Saved …" status surfaces the real on-disk filename, so you can compare.

Known offender: Suno Tracks Exporter. Disable it (or any other download-manager extension) while exporting if you need the suggested filename to land.

Extending

Add a new conversation host

  1. Add the host to manifest.json (host_permissions, content_scripts, web_accessible_resources).
  2. Add a SITES entry in extension/src/popup/popup.js with kind: 'conversation'.
  3. Copy an existing chat extractor and rewrite its selector block.

Add a new structured-data host (profile-style)

  1. Same manifest updates, plus a pageReady/pageHint if the data lives on a sub-page.
  2. New extractor returning makeProfile(...) (or a new schema shape if the data isn't profile-like).
  3. New exporters in exporters.js if a new kind needs different output formats.

Add a new format

Add an exportXxx(data, opts) function in exporters.js returning { filename, blob }, then wire a button in popup.html and a case in runExport() in popup.js.

License

Code is licensed under the Apache License 2.0 (see LICENSE) — you can use, modify, and redistribute it commercially, subject to the usual Apache obligations.

The name DOM Daddy, the mascot, and all files under extension/icons/ plus favicon.ico are © Trollefsen Labs and are not covered by the code license. Forks must rename and re-skin before publishing to the Chrome Web Store or any other distribution channel. Full carve-out in NOTICE.

Why this exists

A from-scratch alternative to closed-source export extensions, with the goal of (1) keeping all extracted data inside the browser and (2) being trivially auditable — no minified bundles, no server, no analytics.

About

Extract structured data from sites that fight scraping. ChatGPT, Claude, Gemini, AI Studio, Perplexity, LinkedIn → Markdown/JSON/CSV. Client-side MV3.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors