Skip to main content
HomeBlogHow to Extract URLs from Text
URL ExtractionWeb ScrapingPythonJavaScriptSEODeveloper Tools

How to Extract URLs from Text — The Complete Guide

Everything you need to know about extracting URLs from plain text, HTML source code, log files, emails, and JSON — using Python, JavaScript, and no-code browser tools.

12 min read
Updated 2025
Beginner & Advanced

TL;DR — Quick Summary

  • URL extraction uses regex to find all http/https links in any text input
  • In Python: re.findall(r'https?://[^s]+', text)
  • In JavaScript: text.match(/https?:\/\/\S+/g)
  • For HTML: use BeautifulSoup (Python) or querySelectorAll('a') (JS) for more reliable link scraping
  • No-code: paste text into the QuickTextTools URL Extractor for instant results

URLs are everywhere — buried inside HTML source code, scattered through server log files, embedded in email bodies, nested inside JSON API responses, and referenced in research documents. Manually finding and copying each one is not only tedious but error-prone at any significant scale.

URL extraction is the practice of automatically identifying and collecting all web addresses from a text source using pattern matching. Whether you are an SEO professional auditing outbound links, a developer debugging an API, a security analyst reviewing a suspicious email, or a researcher cataloging references, knowing how to extract URLs efficiently is an essential skill.

This guide covers everything from the fundamental concepts behind URL extraction to production-ready code in Python and JavaScript, best practices for handling edge cases, and the fastest no-code approach for everyday use.

What Is URL Extraction?

URL extraction is the process of scanning a body of text using a regular expression (regex) pattern to identify and collect all strings that match the structure of a web address. The result is a clean, machine-readable list of URLs that can be used for further processing — validation, deduplication, crawling, or simply human review.

Unlike web crawling (which actively fetches pages and follows links) or web scraping (which parses HTML DOM structure), URL extraction is purely text-based. It works on whatever raw text you provide — it does not need to fetch anything from the network and works offline just as well as online.

URL Extraction vs. Web Scraping vs. Web Crawling

URL Extraction: Regex-based pattern matching on static text input. No network requests. Works on any text format.
Web Scraping: Parses HTML DOM to extract structured data from a live or downloaded page. Requires HTML parsing library.
Web Crawling: Automatically fetches pages, extracts links, and follows them recursively. Requires HTTP client and scheduling.

Why You Need to Extract URLs

The need to extract URLs from text arises across a surprisingly wide range of disciplines. Here are the most common professional scenarios where URL extraction saves significant time.

SEO Link Auditing

Extract all outbound links from a page's HTML to identify external references, broken links, and anchor text patterns.

Security Analysis

Pull all URLs from suspicious emails, phishing messages, or malicious documents for threat intelligence analysis without clicking any links.

Log File Processing

Parse thousands of Apache, Nginx, or application log lines to extract all requested endpoints and external redirects.

Research & Citations

Collect all cited URLs from academic papers, bibliography sections, or reference lists for batch validation.

Content Migration

Extract all media, internal, and external URLs from legacy CMS exports before migrating to a new platform.

API Response Inspection

Pull all URLs from JSON API responses to understand what external resources a service references.

How URL Extraction Works

At its core, URL extraction is a regex search problem. A regular expression is a sequence of characters that defines a search pattern — in this case, a pattern that matches the structure of a URL.

The regex engine scans the input text character by character, attempting to match the URL pattern at each position. When a match is found, the full matched string is captured and added to the results list. The scan then continues from the end of the match, allowing multiple non-overlapping URLs to be found in a single pass.

HOW A URL IS STRUCTURED
https://www.example.com/path/to/page?query=value&id=42#section├── Protocol:  https://
                  ├── Subdomain: www.
                  ├── Domain:    example.com
                  ├── Path:      /path/to/page
                  ├── Query:     ?query=value&id=42
                  └── Fragment:  #section

A good URL regex must correctly handle all of these optional components while being conservative enough not to accidentally match non-URL text like file paths, sentence endings, or parenthetical content adjacent to a URL.

URL Regex Patterns Explained

There is no single universally "correct" URL regex — the right pattern depends on your input format and how conservative you need to be. Here are patterns from simple to comprehensive, with trade-offs explained.

Simple Pattern — Good for plain text
https?://[^\s]+
Matches anything starting with http:// or https:// up to the next whitespace. Fast and simple but may include trailing punctuation like commas or periods.
Standard Pattern — Good for most use cases
https?:\/\/(www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_+.~#?&/=]*)
Handles protocol, optional www, domain, TLD, path, query, and fragment. Handles most real-world URLs correctly. This is what QuickTextTools uses.
Conservative Pattern — For HTML attributes
https?:\/\/[^\s<>'"\)\]]+
Stops at common HTML delimiters — angle brackets, quotes, parentheses, and square brackets. Better for HTML source where URLs are surrounded by attribute syntax.

Extract URLs in Python

Python is the most common language for URL extraction tasks, thanks to its excellent standard library regex support and the powerful BeautifulSoup and requests-html libraries for HTML-specific extraction.

Basic Regex Extraction

import re

                  text = """
                  Visit https://quicktexttools.in for tools.
                  Documentation: https://docs.example.com/api/v2
                  Legacy: http://old.example.com/endpoint?id=42
                  """

                  pattern = r'https?://[^\s]+'
                  urls = re.findall(pattern, text)

                  print(urls)
                  ['https://quicktexttools.in', 'https://docs.example.com/api/v2', 'http://old.example.com/endpoint?id=42']

With Deduplication and Sorting

import re
                def extract_urls(text, unique=True, sort=False):
                pattern = r'https?://[^s<>"')]]+'
                urls = re.findall(pattern, text)if unique:
                    # Preserve order while removing duplicates
                    urls = list(dict.fromkeys(urls))if sort:
                    urls = sorted(urls)return urlsUsage
                with open('input.txt', 'r') as f:
                content = f.read()results = extract_urls(content, unique=True, sort=True)with open('urls.txt', 'w') as f:
                f.write('
'.join(results))print(f"Extracted {len(results)} unique URLs")

Extract URLs from HTML with BeautifulSoup

from bs4 import BeautifulSoup
                  import requests
                  url = 'https://example.com'
                  response = requests.get(url)
                  soup = BeautifulSoup(response.text, 'html.parser')Extract all href links
                  links = []
                  for tag in soup.find_all('a', href=True):
                  href = tag['href']
                  if href.startswith('http'):
                  links.append(href)Also extract src attributes (images, scripts)
                  for tag in soup.find_all(src=True):
                  src = tag['src']
                  if src.startswith('http'):
                  links.append(src)unique_links = list(dict.fromkeys(links))
                  print(f"Found {len(unique_links)} unique links")

Extract URLs from a Log File

import re
                  from collections import Counter
                  pattern = r'https?://[^s"')]+'with open('access.log', 'r') as f:
                  log_content = f.read()urls = re.findall(pattern, log_content)
                  url_counts = Counter(urls)Most frequently accessed URLs
                  for url, count in url_counts.most_common(10):
                  print(f"{count:>6} {url}")

Extract URLs in JavaScript

JavaScript URL extraction works in both browser and Node.js environments. The browser offers additional DOM-based methods for HTML link extraction that are more reliable than regex for structured HTML.

Basic Extraction — Browser & Node.js

const text = `
                  Visit https://quicktexttools.in for tools.
                  API docs: https://docs.example.com/api/v2
                  Legacy: http://old.example.com/endpoint?id=42
                  `;

                  const pattern = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%.+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%+.~#?&/=]*)/gi;

                  const urls = text.match(pattern) || [];

                  console.log(urls);

                  // ['https://quicktexttools.in', 'https://docs.example.com/api/v2', 'http://old.example.com/endpoint?id=42']

With Deduplication and Sorting

function extractUrls(
                  text,
                  { unique = true, sort = false, httpsOnly = false } = {}
                  ) {
                    const pattern =
                      /https?:\/\/(www\.)?[-a-zA-Z0-9@:%.+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%+.~#?&/=]*)/gi;

                    let urls = text.match(pattern) || [];

                    if (httpsOnly) {
                      urls = urls.filter(url => url.startsWith('https://'));
                    }

                    if (unique) {
                      urls = [...new Set(urls)];
                    }

                    if (sort) {
                      urls = urls.sort((a, b) => a.localeCompare(b));
                    }

                    return urls;
                  }

                  // Usage
                  const result = extractUrls(text, {
                    unique: true,
                    sort: true,
                    httpsOnly: false,
                  });

                console.log(`Found ${result.length} URLs`);
                  

DOM-Based Link Extraction (Browser Only)

// Extract all links from the current page DOM
                  function extractPageLinks() {
                  const anchors = document.querySelectorAll('a[href]');
                  const imgSrcs = document.querySelectorAll('img[src], script[src], link[href]');const links = new Set();anchors.forEach(a => {
                  if (a.href && a.href.startsWith('http')) {
                  links.add(a.href);
                  }
                  });imgSrcs.forEach(el => {
                  const src = el.src || el.href;
                  if (src && src.startsWith('http')) {
                  links.add(src);
                  }
                  });return [...links];
                  }// Run in browser console on any page
                  console.log(extractPageLinks());

Real-World Use Cases

SEO Internal Link Audit

Extract all internal and external links from your site's exported HTML sitemaps or CMS page exports. Filter to your own domain to map internal link structure, then check for orphaned pages or pages with too few incoming links.

Phishing Email Analysis

Security teams routinely receive suspicious emails for analysis. Extracting all URLs before clicking any of them allows analysts to submit links to VirusTotal, check domain reputation, and identify redirects — safely and systematically.

API Response Validation

When building integrations, API responses often contain URLs to resources, webhooks, or callback endpoints. Extracting all URLs from a JSON response body helps you verify that all expected endpoints are present and correctly formatted.

Academic Reference Collection

Researchers compiling literature reviews paste entire bibliography sections into the extractor to collect all cited URLs at once. The extracted list can be bulk-validated for link rot using a link checker tool.

Extracting URLs from HTML

HTML presents special challenges for URL extraction. URLs appear not just in visible text but in href attributes, src attributes, data-* attributes, CSS background-image declarations, JavaScript string literals, and meta refresh tags. A comprehensive HTML URL extraction strategy should account for all of these.

Where URLs Hide in HTML

<a href="https://...">Anchor links — most common
<img src="https://...">Image sources
<script src="https://...">External JavaScript
<link href="https://...">CSS stylesheets, favicons
<iframe src="https://...">Embedded frames
<video src="https://...">Media elements
background-image: url("https:/...Inline CSS
<meta http-equiv="refresh" con...Meta refresh redirects

For simple extraction from raw HTML text, the regex approach works well. For structured HTML parsing where you need attribute-specific extraction or need to handle relative URLs, use BeautifulSoup (Python) or the browser DOM API (JavaScript) for more reliable results.

Extracting URLs from Log Files

Server access logs are one of the richest sources of URL data in any production environment. Apache and Nginx access logs record every request URL, referrer URL, and sometimes redirect target URL — thousands of entries per minute on a busy server.

Example Apache Combined Log Format
192.168.1.1 - - [01/Jun/2025:12:34:01 +0000] "GET /api/users HTTP/1.1" 200 1423 "https://app.example.com/dashboard" "Mozilla/5.0" 192.168.1.2 - - [01/Jun/2025:12:34:02 +0000] "POST /api/orders HTTP/1.1" 201 892 "https://app.example.com/checkout" "Mozilla/5.0"

In log files, URLs typically appear in three positions: the request path (which is a relative URL), the referrer field (which is a full absolute URL), and occasionally in query string parameters. The regex extractor only captures absolute URLs, so request paths like{" "}/api/users are not extracted — only the full referrer URLs are captured.

For large log files, use the Python script approach with file streaming rather than loading the entire file into memory. Process the file line by line for memory efficiency on multi-gigabyte log archives.

Best Practices for URL Extraction

Always deduplicate before using extracted lists

The same URL often appears multiple times in navigation, body, and footer. Deduplication significantly reduces the size of your working list and prevents redundant processing.

Validate extracted URLs before fetching them

Use urllib.parse.urlparse() in Python or the URL constructor in JavaScript to validate that extracted strings are well-formed URLs before making any HTTP requests to them.

Handle URL encoding carefully

URLs in some contexts may be percent-encoded (%20 for space, %2F for /, etc.). Decode them with urllib.parse.unquote() before comparison or storage to avoid treating the same URL as multiple distinct entries.

Normalize trailing slashes for deduplication

https://example.com and https://example.com/ are canonically the same URL but are treated as distinct strings by simple deduplication. Normalize by stripping trailing slashes before deduplicating.

Be conservative with regex boundaries

URLs in prose text are often followed by punctuation — periods, commas, closing parentheses. A conservative regex that excludes these trailing characters prevents garbage from being included in extracted URLs.

Use browser-side tools for sensitive data

When extracting URLs from internal documents, emails, or private data, prefer browser-side tools that never transmit your data to external servers. Verify network activity if you are unsure.

Common Mistakes to Avoid

Using regex on structured HTML when a DOM parser is available

Regex is the right tool for arbitrary text. For HTML, BeautifulSoup or the browser DOM API handle edge cases — malformed tags, nested attributes, entity encoding — far more reliably than regex.

Extracting URLs and immediately fetching them without rate limiting

A large extracted URL list can contain thousands of URLs from the same domain. Fetching them all without rate limiting will trigger rate limits or bans. Always throttle requests when batch-checking extracted URLs.

Treating URL extraction as equivalent to link validation

A successfully extracted URL is not necessarily a working URL. URL extraction confirms the text pattern exists; link validation requires making an HTTP request. These are separate steps.

Forgetting to handle relative URLs when processing HTML

HTML pages often contain relative links like /api/v1/users or ../images/logo.png. Regex extraction misses these entirely. For comprehensive HTML link extraction, combine regex with DOM parsing and resolve relative URLs against the base URL.

Not accounting for URL-encoded characters

If your source text comes from a URL parameter or a URL-encoded form body, ampersands and other URL characters may be encoded as &amp; or %26. Decode the text before extraction for accurate results.

URL Extraction Methods Compared

MethodBest ForSkill LevelSpeed
QuickTextTools URL ExtractorOne-off tasks, any text formatNoneInstant
Python re.findall()Scripts, log files, automationBeginnerVery Fast
Python BeautifulSoupStructured HTML extractionIntermediateFast
JavaScript match()Browser extensions, Node.js scriptsBeginnerVery Fast
Browser DevTools ConsoleQuick DOM link extraction on live pagesIntermediateInstant
grep (command line)Large log files on Linux/macOSIntermediateVery Fast

Try the Free URL Extractor Now

No code required. Paste any text and extract every URL instantly — with deduplication, sorting, and download built in.

Open URL Extractor Tool

Frequently Asked Questions

Conclusion

URL extraction is a fundamental skill for anyone who works with web data. Whether you are auditing a site's link profile, analyzing server logs, reviewing a suspicious email, or collecting references from a research paper, the ability to quickly pull all URLs from any text source saves significant time and eliminates manual error.

For quick one-off tasks, the QuickTextTools URL Extractor gives you instant results without any code. For automated workflows and production scripts, the Python and JavaScript patterns in this guide provide a solid foundation that you can extend with validation, normalization, and rate-limited fetching.

Remember the key principles: always deduplicate, validate before fetching, be conservative with regex boundaries to avoid capturing trailing punctuation, and choose browser-side tools when working with sensitive content.