Extract hyperlinks from html using regular expression in java


2 years ago, I worked with an crawler which can fetch webpages from internet, then parse the links from the page and then visit all the pages linked to the page. At that time I didn't have any idea about regular expressions. So I had to write around a 500 hundred line code to parse links and meta tags from html.

Yesterday, I had to do the same job again. This time I took up regular expression to parse the html <a tags followed by href attribute to extract the links.

Regular expressions can be difficult to understand if written at once, so I am going to write it in easy way first, then i ll make it complex to support variations in page links.



1. Parsing page title using regular expression

First i parsed html titles from pages with this regex

<title>(.*?)</title>

This is a very simple form of regular expression. Which says to find the string <title> then "." is for any charecter. * means the previous character can repeat 0 or more times. The ? sign means that first to look for 0 matches. </title> means that there must be the string</title> after. brackets defines a group of charecters that is any charecter happenning 0 or more times.


1.1 Match anything between <a> and </a>

<a>(*.?)</a>

Similarly why not try this for "a" tag ??


1.2 Match anything in a tag

What we had done in the previous step is just to match the title of a tag

<a href="http://www.dscripts.net">DSCRIPTS</a> would only return "DSCRIPTS"

But to strip the real url must access the attributes of a tag. So get in little bit deeper

<a(.*?)</a>

This would return anything between <a and </a> But we actually need anything between <a href write


1.3 Match <a href

<a href(.*?)</a>

This would match ="http://www.dscripts.net">DSCRIPTS


<a href=(.*?)</a>

This would match "http://www.dscripts.net">DSCRIPTS

1.4 Match both href tag and link title

So far we had the tail >DSCRIPTS. We need to remove it. We can do this by declaring another group. First one will match href attribute and second one will match the title

<a href=(.*?)>(.*?)</a>

Here the first group will match the link "http://www.dscripts.net" and the second one will match DSCRIPTS


1.5 Match a href between quotes

So far this examples we didn't search the quotes arround href attribute so it returned "http://www.dscripts.net" not http://www.dscripts.net

Well this part is little tricy, lets first say whats we are gonna do.

We need to match for one quote " after href= then followed by any charecter  and then another " ??

href =\"(.*?)\"

Note: quotes required to escaped with backslash \

Not exactly if we say any character then it will also match " so it will never catch the trailing " on that tag. Rather catch the last quote in last a tag in document :(

<a href="link1.html">LINK 1</a>

<a href="link2.html">LINK 2</a>

<a href="link3.html " >LINK 3</a>

So this will return the yellow marked result. Do you really want that :p

Surely ans is no.

So in our expression we must say not to have any " between preceding and trailing quotes. Right ?

Here we will write one expression that will define to match any charecters exlcules a set of charecters we set

[^\"]

This defines a set of charecter which is true for anything else ". and we can have this any number of time

[^\"]

So we rewrite our expression as follow

<a href=\"([^\"]*)\">(.*?)</a>


1.6 What if the a tag has some whitespaces arround??

Here we have used a very neat html here. it will match anything like

<a href="http://www.yoursite1.com">Site 1</a>

<a href="http://www.yoursite2.com">Site 2</a>

anything similar.

But what if ??

<a href = "http://www.yoursite.com" >Site</a>

or

<a     href    =   "http://www.yoursite.com"    >Site</a>

This not gonna match :(

so we need to modify our expression to allow this whitespaces

whitespaces are denoted as \s in regular expression.

So we modify our expression as

<a \s*?href\s*=\s*\"([^\"]*)\"\s*>(.*?)</a>

So now it can match these criterias.

 

1.7 What if the tag has some more attribute.

Sure you cant say that a tag will have only href attribute! What if it is anything like these below

<a class="my-class" href="http://www.dscirpt.net">DSCRIPTS</a>

<a id="HOME" class="my-class" href="http://www.dscirpt.net">DSCRIPTS</a>

<a class="my-class" href="http://www.dscirpt.net" id="HOME" >DSCRIPTS</a>

<a href="http://www.dscirpt.net" id="HOME" class="my-class" >DSCRIPTS</a>

 

In most case you are going to face hyperlinks like this.

So we must also think about the other attributes before and after href attribute. They can either many or none in either side. :s

So on both side of href we can have any characters except > what means the end of start tag and start to link title

<a\s[^>]*href\s*=\s*\"([^\"]*)\"[^>]*>(.*?)</a>

 

As you can see I added [^>]* on both sides of href tag to say we can have anything else > which is the end of start tag. So by using this we are catching all attributes wether they exists or not on both side of href attribute

\s is an optional replacement of " " (space) which defines there must be one white space immediately after <a

 

Now take a look at the final expression once again…

 

 

<a            Must have <a
\s            Must have one whitespace
[^>]*         Can have anything except > and can happen 0 or more times (Any attribute)
href          Must have href
\s*           May have whitespace for 0 or more times
=             Must have =
\s*           May have whitespace for 0 or more times
\"            Must have one "
  (             Start of first group
    [^\"]*        Can have any charecter except " and can repeat 0 or more times
  )             End of first group
\"            Must have one "
[^>]*         Can have anything except > and can happen 0 or more times (Any attribute)
>             Must have >
  (             Start of second group
    .*?           can have any charecter 0 or more times
  )             End of second group
</a>          Must have end tag

 

 

Drawbacks

Although I tried to make it effecient as much as possible, but yet its already have so many drawbacks. One of the most important thing is not to detect href with single quote

<a href='http://www.dscripts.net'>DSCRIPTS</a>

It also case sensetive meaning it can only detect lowercase tags not

<A HREF='http://www.dscripts.net'>DSCRIPTS</a>

I detected end of attributes with > but what about this

<a class="broker-class" title="this will break > regular expression" href='http://www.dscripts.net'>DSCRIPTS</a>

These are still not clear to me. I hope to update this expression as much as possible

 

 

Regular Expressions in Java

Now we have the expression now we are going to implement this on java.

To use regular expession in java we mainly use to classes Pattern and Matcher from java.util.regex package.

Pattern is responsible for creating an compiled object of regula expression from a string. And matcher is responsible for matching it with input data.

Note : In regular expressions we used backslash for special purposes. When writting regular expression as java string we need to escape this backslashes with another backslash.

Here is simple parser class used to parse links from web page, I used an hashmap instead of array because i do not need to have the repeative links from the page. I also used the hashmap to count the occurences of links.

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package crawler;

import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 *
 * @author burhan
 */
public class parser {
    // regular expression for parsing links
    private String regex_links = &quot;&lt;a\\s[^&gt;]*href\\s*=\\s*[\&quot;\&#39;]?([^\&quot;\&#39; ]*)[\&quot;\&#39;]?[^&gt;]*&gt;(.*)&lt;/a&gt;&quot;;
    //hash map to store the links
    private HashMap&lt;String, Integer&gt; link_map;

    public void parse(String data) {
        // create pattern object
        Pattern p = Pattern.compile(regex_links);
        // create mather object
        Matcher m = p.matcher(data);

        String link = null;
        link_map = new HashMap();

        // search the input strings
        while (m.find()) {
            // find links which is in group 1
            link = m.group(1);
           // check if hasmap already contains the link or not
            if(link_map.containsKey(link))
                link_map.put(link, link_map.get(link)+1); // set count +1
            else
                link_map.put(link, 1); // set count 1
        }
    }

   //returns the links
    public HashMap&lt;String, Integer&gt; getLinks(){
        return link_map;
    }

}

Related posts:

Tags: , , ,

To make money we lose our health, and then to restore our health we lose our money.... We live as if we are never going to die, and we die as if we never lived!

Leave a Reply