DWITE Online Computer Programming Contest

Don’t follow my links

January 2008
Problem 3

There’s a lot of spam on the Internet – blog comments, forum posts, etc., all done for the purpose of planting enough links and influencing search engines such as Google to think that a certain page is more important than it should be. One of the solutions is to mark untrusted links with a rel="nofollow" attribute, telling spiders to ignore the link. A sample link might look like this:

<a href="http://compsci.ca/" title="Computer Science Canada" rel="nofollow">sample link</a>

The goal is to write a program that will find all the links in a text file and insert the nofollow attributes properly. rel="" should be inserted as the last attribute of the link, unless it already exists. The nofollow value should be inserted last in the rel= string, unless it already exists. Rel could have multiple values, space-separated. Refer to the sample input for examples.

The input file DATA3.txt will contain five lines of text, each containing one link in the form <a*>*</a>. Links might be surrounded by filler text. Each line will be no more than 255 characters long.

The output file OUT3.txt will contain five lines – just the modified links.

Sample Input:
This is a <a>sample link</a>.
<a rel="" href="http://dwite.ca/">link with rel</a>
<a href="http://compsci.ca/" rel="nofollow">link with no follow</a>
<a href="http://compsci.ca/blog" rel="external">more rels</a>
text <a href="http://compsci.ca/v3/viewforum.php?f=131" title="">link</a> more text
Sample Output:
<a rel="nofollow">sample link</a>
<a rel="nofollow" href="http://dwite.ca/">link with rel</a>
<a href="http://compsci.ca/" rel="nofollow">link with no follow</a>
<a href="http://compsci.ca/blog" rel="external nofollow">more rels</a>
<a href="http://compsci.ca/v3/viewforum.php?f=131" title="" rel="nofollow">link</a>