When CSS selectors fail, XPath is your secret weapon.
CSS selectors are great for styling and basic scraping, but they have limits. They can't select elements based on text content (e.g., "find the button that says 'Next'"). They can't traverse up the DOM tree to find a parent element. They struggle with complex logical conditions.
XPath (XML Path Language) has none of these limitations. It allows you to navigate the entire document structure with surgical precision.
This cheat sheet is designed for no-code users, data analysts, and scraping enthusiasts who need to handle complex extraction scenarios without becoming developers.
Why Learn XPath?
If you're using tools like Lection, standard CSS selectors usually work fine. But for roughly 10% of tough scraping jobs, XPath is the only solution.
1. Text-Based Selection: XPath can find elements containing specific text.
- CSS: Impossible.
- XPath:
//button[text()='Submit']
2. Reverse Navigation: XPath can find a child element and then select its parent.
- CSS: Impossible (until
:has()is fully supported, but even then limited). - XPath:
//span[@class='price']/parent::div
3. Flexible Logic: XPath supports and, or, and not logic.
- CSS: Limited to chaining.
- XPath:
//div[@class='product' and not(contains(@class, 'out-of-stock'))]
XPath Basics
An XPath is just a path to an element, like a file path on your computer.
| Symbol | Name | Description |
|---|---|---|
/ | Root | Starts selection from the document root (absolute path). |
// | Descendant | Selects elements anywhere in the document (relative path). Use this 99% of the time. |
. | Current | Selects the current node. |
.. | Parent | Selects the parent of the current node. |
@ | Attribute | Selects an attribute (e.g., @href, @class). |
Absolute vs. Relative Paths
Avoid absolute paths. They are brittle and break if the website changes a single <div>.
- ❌ Bad (Absolute):
/html/body/div[2]/div[4]/table/tbody/tr/td[3] - ✅ Good (Relative):
//td[@class='price']
1. Selecting by Attributes
Target elements based on their ID, class, or other attributes.
| Goal | XPath | Explanation |
|---|---|---|
| By ID | //*[@id='main-content'] | finds any element with id="main-content" |
| By Class | //div[@class='post'] | finds divs with class="post" |
| By Multiple Classes | //div[contains(@class, 'product')] | finds divs where class contains "product" (handles "product featured") |
| By Title | //a[@title='Home'] | finds links with tooltip title "Home" |
| By Link URL | //a[@href='/login'] | finds links pointing to "/login" |
Pro Tip: Use
contains(@class, 'name')instead of@class='name'. CSS classes often have multiple values (e.g.,class="btn btn-primary"). An exact match for'btn'would fail, butcontainsworks.
2. Selecting by Text Content (The Superpower)
This is why you use XPath. Finding elements based on what they say.
| Goal | XPath | Explanation |
|---|---|---|
| Exact Text | //button[text()='Add to Cart'] | finds buttons with exact text "Add to Cart" |
| Example Text | //h2[text()='Features'] | finds headings saying "Features" |
| Contains Text | //div[contains(text(), 'Error')] | finds divs containing the word "Error" (e.g., "Error: 404") |
| Starts With | //p[starts-with(text(), 'Note: ')] | finds paragraphs starting with "Note: " |
Real-World Example: Finding specific buttons
You want to click the "Next Page" button, but it has no ID or clear class.
//button[contains(text(), 'Next')]
3. Navigating the Hierarchy (Axes)
Move up, down, or sideways in the HTML tree.
| Goal | XPath | Explanation |
|---|---|---|
| Parent | //span[@class='price']/.. | finds the parent element of the price span |
| Ancestor | //a/ancestor::div[@class='card'] | finds the div.card containing the link |
| Following Sibling | //h2[text()='Description']/following-sibling::p | finds paragraphs after the Description heading |
| Preceding Sibling | //td[text()='Total']/preceding-sibling::td | finds the cell before the Total cell |
Real-World Example: Extracting data relative to a label
A table shows "Email: user@example.com" but "Email" is in one cell and the address in the next.
//td[text()='Email:']/following-sibling::td[1]
This finds the "Email:" cell and grabs the immmediately following cell.
4. Logic & Filtering
Combine conditions to make your selectors bulletproof.
| Operator | XPath | Explanation |
|---|---|---|
| AND | //input[@type='text' and @name='user'] | matches both conditions |
| OR | //button[text()='Login' or text()='Sign In'] | matches either text |
| NOT | //div[@class='product' and not(contains(@class, 'sold'))] | matches products that are NOT sold out |
| Position | //li[1] | first item in a list |
| Last | //li[last()] | last item in a list |
Common Scraping Scenarios
Scenario A: The "Load More" Button
The button usually says "Load More" or "Show More".
//button[contains(text(), 'More')]
Scenario B: Data Hidden in Attributes
Product rating is visually "4 stars" but the code is <span aria-label="4.5 out of 5 stars"></span>.
//span[contains(@aria-label, 'stars')]
You would then extract the aria-label attribute content.
Scenario C: Tables without Classes
A generic table. You want the second column of every row.
//table/tbody/tr/td[2]
Testing Your XPath
You don't need a special tool to test XPath. Your browser does it.
- Open Chrome.
- Right-click the page > Inspect.
- Press Cmd+F (Mac) or Ctrl+F (Windows) within the Elements panel.
- Paste your XPath.
- Chrome will highlight the matching elements.
XPath vs No-Code
Writing XPath is powerful, but it's still manual work. It breaks if the text changes (e.g., "Log In" becomes "Login").
Modern no-code tools like Lection abstract this away. Lection's AI analyzes the page visually. It understands that a button in the top right is likely a "Login" button regardless of its text or ID.
However, having XPath skills in your back pocket gives you superpowers. When the AI is confused by a messy page, or you need extremely specific data logic (like "only products with 'Pro' in the title costs less than $50"), an XPath snippet is often the cleanest solution.
Summary Checklist
- Always use relative paths (
//) - Prefer ID or specific Class names first.
- Use
contains()for partial matches (more robust). - Use
text()when classes are generic. - Test in Chrome DevTools using Cmd+F.
Ready to test your skills? Install Lection and try building a custom definition for a site you use every day.