Given a url startUrl
and an interface HtmlParser
, implement
a Multi-threaded web crawler to crawl all links that are under the same
hostname as startUrl
.
Return all urls obtained by your web crawler in any order.
Your crawler should:
startUrl
HtmlParser.getUrls(url)
to get all urls from a webpage of given url.
startUrl
.
As shown in the example url above, the hostname is example.org
. For simplicity
sake, you may assume all urls use http protocol without
any port specified. For example, the urls http://leetcode.com/problems
and http://leetcode.com/contest
are under the same hostname, while urls
http://example.org/test
and http://example.com/abc
are not under
the same hostname.
The HtmlParser
interface is defined as such:
interface HtmlParser { // Return a list of all urls from a webpage of given url. // This is a blocking call, that means it will do HTTP request and return when this request is finished. public List<String> getUrls(String url); }
Note that getUrls(String url)
simulates performing a HTTP
request. You can treat it as a blocking function call which waits for a HTTP request to
finish. It is guaranteed that getUrls(String url)
will return the urls
within 15ms. Single-threaded solutions will exceed the time limit so,
can your multi-threaded web crawler do better?
Below are two examples explaining the functionality of the problem, for custom testing
purposes you'll have three variables urls
, edges
and startUrl
. Notice that you will only have access to startUrl
in your code, while urls
and edges
are not directly accessible to you in
code.
Follow up:
Example 1:
Input: urls = [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com", "http://news.yahoo.com/us" ] edges = [[2,0],[2,1],[3,2],[3,1],[0,4]] startUrl = "http://news.yahoo.com/news/topics/" Output: [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.yahoo.com/us" ]
Example 2:
Input: urls = [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com" ] edges = [[0,2],[2,1],[3,2],[3,1],[3,0]] startUrl = "http://news.google.com" Output: ["http://news.google.com"] Explanation: The startUrl links to all other pages that do not share the same hostname.
Constraints:
1 <= urls.length <= 1000
1 <= urls[i].length <= 300
startUrl
is one of the urls
.