How do I extract meta tags from HTML?
Extracting meta tags involves parsing the HTML <head> section and identifying all <meta> tags along with their attributes.
Tools to use:
Use an HTML parser appropriate for your language:
- Cheerio (Node.js)
- Beautiful Soup (Python)
- Browser DevTools
Basic extraction:
- Select all
<meta>tags using CSS selectors:meta[name]for standard meta tagsmeta[property]for Open Graph tags
- Extract attributes:
nameorproperty(the meta tag identifier)content(the meta tag value)
- Also extract
<title>tags and relevant<link>tags (canonical, alternate languages, RSS feeds)
Organizing extracted data:
- Group by category (SEO, Open Graph, Twitter Cards, technical tags)
- Create key-value pairs where the
name/propertyis the key andcontentis the value - Handle multiple tags with the same name (like multiple
keywordstags) - Preserve order when it matters
Example code (Node.js/Cheerio):
import * as cheerio from 'cheerio';
const $ = cheerio.load(html);
const metaTags = {};
// Extract standard meta tags
$('meta[name]').each((i, el) => {
const name = $(el).attr('name');
const content = $(el).attr('content');
metaTags[name] = content;
});
// Extract Open Graph tags
$('meta[property]').each((i, el) => {
const property = $(el).attr('property');
const content = $(el).attr('content');
metaTags[property] = content;
});
// Extract title
metaTags.title = $('title').text();
Handling edge cases:
- Some sites use non-standard meta tag attributes
- Dynamically generate meta tags via JavaScript (requiring headless browsers)
- Have malformed HTML with unclosed or nested meta tags
- Use custom meta tags for internal purposes (like
fb:app_idfor Facebook)_
Common patterns:
- Check for
<link rel="canonical">for the preferred URL - Look for
<meta name="robots">to understand indexing preferences - Find
<meta name="author">for content attribution - Extract structured data from JSON-LD script tags (though technically not meta tags)
Our Meta Tag Extractor automatically handles all these cases and presents meta tags in a structured, easy-to-read format.