← Back to all tools

Extract text from PDF

Pull every bit of text out of any PDF. Copy to clipboard or download as a plain .txt file. 100% in your browser, works offline, — the file never leaves your device.

📝

Drop a PDF file here

or click to choose from your device

Choose a PDF
🔒 Your file stays on your device. This tool runs entirely in your browser and works completely offline. No upload needed.

How to extract text from a PDF

  1. Drop your PDF or click to choose it.
  2. Pick "All pages" or specify a range like 1, 3, 5-7.
  3. Tune the formatting options if needed — page markers, line breaks, whitespace.
  4. Click "Extract text" — the result appears below.
  5. Copy it to your clipboard or download a .txt file.

What does "Extract text" do?

Extracting text means pulling the readable characters out of a PDF into plain text — strings you can paste into Word, Notes, an email, a spreadsheet, or a code editor. PDFtez walks every page (or just the pages you choose) and reads the embedded text positions, character by character, in the order a human would read them — left to right, top to bottom, with reasonable handling of columns.

Three options let you tune the output:

  • Include page markers. Insert --- Page N --- between pages so you can navigate a long extraction quickly. Useful for long documents; turn off when you just want clean prose.
  • Preserve line breaks. Keep the PDF's visual line structure. Turn off to get continuous paragraphs — better for pasting into a word processor that will re-wrap text.
  • Collapse extra whitespace. Replace runs of spaces with a single space. Helpful for tables and aligned data where PDFs often pad with many spaces to position text visually.

There's an important limitation, which we surface in the tool itself: this extracts embedded text. PDFs from scanners or photos are typically images of pages with no embedded text, in which case there is nothing to extract — see the OCR section below.

When to use Extract Text

  • Quoting from a research paper. Pull selected pages from a PDF article so you can copy quotes accurately into notes or a literature review.
  • Repurposing report content. An old report needs to become a blog post, a slide deck, or a Word draft — extract the text first, restructure in your editor.
  • Feeding LLMs. Most large-language-model tools work better with plain text than with a PDF binary. Extract first, then paste into ChatGPT, Claude, or your in-house tool.
  • Indexing or searching long documents. A 500-page legal PDF is hard to search in a viewer; extracted to .txt, you can grep it from the command line or load it into any text-search tool.
  • Translation workflows. Most translation tools accept plain text more reliably than PDFs. Extract, translate, then re-pour into a layout if needed.
  • Reading on a small screen. A reflowable text file is much easier to read on a phone than a fixed-layout PDF — extract the text, send it to your reading app of choice.

How PDFtez extracts text (under the hood)

PDFtez uses PDF.js — the same engine that powers Firefox's built-in PDF viewer — to parse the PDF and walk its text content streams. Every glyph in a PDF carries position information (where on the page it sits) and the underlying character it represents. PDFtez reads each page's stream, sorts the glyphs in reading order, joins them into lines based on vertical position, and assembles the lines into pages.

The "Preserve line breaks" toggle controls whether the visual line structure (as it appears in the PDF) is kept or whether line breaks are removed inside paragraphs so the text re-flows. "Collapse whitespace" applies a simple replacement after the extraction step.

The output is plain UTF-8 text. You can copy to the clipboard with one click, or download as a .txt file. No upload, no server, no logging — the entire extraction runs in your browser.

What about scanned PDFs? (OCR)

A scanned PDF is a series of images of pages. There is no embedded text, only pixels. To turn those pixels back into readable text you need Optical Character Recognition (OCR) — a separate process that looks at the image of each character and predicts which letter it represents.

PDFtez's Extract Text tool deliberately does not run OCR client-side because high-quality OCR requires a large language-aware model (typically 30–100 MB to download) and significant CPU. Running it in your browser would mean a long initial load and slow processing on every file. OCR is on the PDFtez roadmap as a separate tool with a clear "this will load a 50 MB model" disclosure.

In the meantime, two free OCR options work well: (a) upload the scanned PDF to Google Docs (right-click → Open with → Google Docs), which performs OCR automatically on the way in; (b) use Adobe Acrobat's built-in OCR feature. Both run server-side, so consider whether the scanned document is sensitive before using them.

How is PDFtez Extract Text different?

Most online extract-text tools upload your PDF to a server, do the extraction there, and let you copy or download the result. PDFtez does the operation locally — no upload, no daily limit, no watermark, and the extraction is instant for documents up to a few hundred pages. The formatting controls (page markers, line breaks, whitespace) are more granular than most free tools provide.

Compared to desktop tools like Adobe Acrobat's "Save as Text" or pdftotext on the command line, PDFtez is a friction-free option when you need this once or twice — no software install, no licence.

Frequently asked questions

Why is my extraction empty / produces nothing?

Almost certainly because the PDF is a scan — an image of a page, with no embedded text. PDFtez's Extract Text only pulls embedded text; for scans you need an OCR step first (see the OCR section above). To check: open the PDF in your browser, try to select a word with your cursor. If you can't select anything, it's a scan.

Will tables come out correctly?

PDF tables don't carry table structure — they're just text positioned in columns. Extraction will produce the text but the column alignment depends on how the original PDF was constructed. Turning on Collapse whitespace usually helps; for spreadsheet-shaped tables, PDF to Excel is the better tool because it explicitly reconstructs the table.

Does it handle non-English characters?

Yes. The extraction handles any character the PDF encodes — including extended Latin, Greek, Cyrillic, Arabic, CJK, mathematical symbols, and emoji. The output is UTF-8 so it survives copy-paste into any modern editor or web form.

Can I extract from specific pages only?

Yes. Switch to "Specific pages" and use the same syntax as the Split tool: 1, 3, 5-7. The output includes only those pages, in the order you listed them. Useful when you need a specific chapter or section.

Will the extracted text preserve bold, italics, or fonts?

No — plain text by definition has no styling. The extraction discards font, size, bold, and italic information. If you need formatted output, use PDF to Word instead, which preserves a much richer subset of the original formatting.

Where do my files go? Are they uploaded?

Your files stay on your device. The extraction runs entirely in your browser. No file data is uploaded, no copy is stored, and nothing is logged. You can verify this in DevTools → Network while you extract.

Related PDF tools

✍️ Need someone to sign this PDF? Compare the best eSign tools.Compare eSign tools →