下面演示的过程是基于目前 Nutch 2.2.1 自己编译配置的版本。
在编译后 bin目录下有两个脚本文件:nutch
和 crawl
,在命令行下执行各命令即可查看具体使用说明:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
$ nutch Usage: nutch COMMAND where COMMAND is one of: inject inject new urls into the database hostinject creates or updates an existing host table from a text file generate generate new batches to fetch from crawl db fetch fetch URLs marked during generate parse parse URLs marked during fetch updatedb update web table after parsing updatehostdb update host table after parsing readdb read/dump records from page database readhostdb display entries from the hostDB elasticindex run the elasticsearch indexer solrindex run the solr indexer on parsed batches solrdedup remove duplicates from solr parsechecker check the parser for a given url indexchecker check the indexing filters for a given url plugin load a plugin and run one of its classes main() nutchserver run a (local) Nutch server on a user defined port junit runs the given JUnit test or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters. |
1 2 |
$ crawl Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds> |
在Nutch2.x版本中,爬取流程所涉及的命令做了优化,整合到了crawl 命令中,使用者只需要执行一个命令 crawl 即可完成爬取流程,而不必像老版本中那样,必须依次地执行 inject、generate、fetch、parse等命令。对于初学者来说仍然可以依次执行相关命令 ,仔细观察每执行一步引起的数据变化。下面以抓取 本人博客网站为例详细说明下抓取的过程:
[准备]:创建需要抓取的URL
- 首先启动hbase (本文是在单机模式下演示的)
- mkdir -p urls
- cd urls
- touch seed.txt
- echo ‘http://micmiu.com’ >seed.txt
下面每一步执行后都可以查看HBase中数据的变化情况。
[第一步]:inject
1 2 3 4 5 6 7 |
$ nutch inject urls -crawlId micmiublog InjectorJob: starting at 2015-01-12 09:42:46 InjectorJob: Injecting urlDir: urls 2015-01-12 09:42:47.096 java[14509:4735452] Unable to load realm info from SCDynamicStore InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 1 |
查看HBase中得数据:
1 2 3 4 5 6 7 8 9 |
hbase(main):016:0> scan 'micmiublog_webpage' ROW COLUMN+CELL com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00 com.micmiu:http/ column=f:ts, timestamp=1421026970740, value=\x00\x00\x01J\xDB\xCE\xBC\xF2 com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0 com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00 com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00 1 row(s) in 0.1010 seconds |
[第二步]:generate
1 2 3 4 5 6 7 8 9 10 |
$ nutch generate -topN 5 -crawlId micmiublog GeneratorJob: starting at 2015-01-12 09:47:09 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: true GeneratorJob: normalizing: true GeneratorJob: topN: 5 2015-01-12 09:47:09.822 java[14533:4744993] Unable to load realm info from SCDynamicStore GeneratorJob: finished at 2015-01-12 09:47:13, time elapsed: 00:00:03 GeneratorJob: generated batch id: 1421027229-1374349927 |
查看HBase中得数据:
1 2 3 4 5 6 7 8 9 10 11 |
hbase(main):018:0> scan 'micmiublog_webpage' ROW COLUMN+CELL com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927 com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00 com.micmiu:http/ column=f:ts, timestamp=1421026970740, value=\x00\x00\x01J\xDB\xCE\xBC\xF2 com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927 com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0 com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00 com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00 1 row(s) in 0.0580 seconds |
[第三步]:fetch
ps:上一步执行的日志中 GenerateorJob batch id 的值 作为下面命令的参数 batchId的值
也可以从hbase中重查询到:
1 2 3 4 |
hbase(main):025:0> get 'micmiublog_webpage','com.micmiu:http/',{COLUMNS => 'f:bid'} COLUMN CELL f:bid timestamp=1421027232815, value=1421027229-1374349927 1 row(s) in 0.0060 seconds |
下面执行 fetch 命令:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
$ nutch fetch 1421027229-1374349927 -crawlId micmiublog -threads 10 FetcherJob: starting FetcherJob: batchId: 1421027229-1374349927 FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 2015-01-12 09:49:37.095 java[14546:4753667] Unable to load realm info from SCDynamicStore Using queue mode : byHost Fetcher: threads: 10 QueueFeeder finished: total 1 records. Hit by time limit :0 fetching http://micmiu.com/ (queue crawl delay=5000ms) -finishing thread FetcherThread1, activeThreads=1 -finishing thread FetcherThread2, activeThreads=1 -finishing thread FetcherThread3, activeThreads=1 -finishing thread FetcherThread4, activeThreads=1 -finishing thread FetcherThread5, activeThreads=1 -finishing thread FetcherThread6, activeThreads=1 -finishing thread FetcherThread7, activeThreads=1 -finishing thread FetcherThread8, activeThreads=1 Fetcher: throughput threshold: -1 Fetcher: throughput threshold sequence: 5 -finishing thread FetcherThread9, activeThreads=1 -finishing thread FetcherThread0, activeThreads=0 0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues -activeThreads=0 FetcherJob: done |
查看HBase中得数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
hbase(main):019:0> scan 'micmiublog_webpage' ROW COLUMN+CELL com.micmiu:http/ column=f:bas, timestamp=1421027385487, value=http://micmiu.com/ com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927 com.micmiu:http/ column=f:cnt, timestamp=1421027385487, value= com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00 com.micmiu:http/ column=f:prot, timestamp=1421027385487, value=\x18\x02,http://www.micmiu.com/\x00\x00 com.micmiu:http/ column=f:pts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xCE\xBC\xF2 com.micmiu:http/ column=f:rpr, timestamp=1421027385487, value=http://micmiu.com/ com.micmiu:http/ column=f:st, timestamp=1421027385487, value=\x00\x00\x00\x05 com.micmiu:http/ column=f:ts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xD5\x17% com.micmiu:http/ column=f:typ, timestamp=1421027385487, value=text/html com.micmiu:http/ column=h:Cache-Control, timestamp=1421027385487, value=no-store, no-cache, must-revalidate, post-check=0, pre-check=0 com.micmiu:http/ column=h:Connection, timestamp=1421027385487, value=close com.micmiu:http/ column=h:Content-Encoding, timestamp=1421027385487, value=gzip com.micmiu:http/ column=h:Content-Length, timestamp=1421027385487, value=20 com.micmiu:http/ column=h:Content-Type, timestamp=1421027385487, value=text/html; charset=UTF-8 com.micmiu:http/ column=h:Date, timestamp=1421027385487, value=Mon, 12 Jan 2015 01:49:41 GMT com.micmiu:http/ column=h:Expires, timestamp=1421027385487, value=Thu, 19 Nov 1981 08:52:00 GMT com.micmiu:http/ column=h:Location, timestamp=1421027385487, value=http://www.micmiu.com/ com.micmiu:http/ column=h:Pragma, timestamp=1421027385487, value=no-cache com.micmiu:http/ column=h:Server, timestamp=1421027385487, value=LiteSpeed com.micmiu:http/ column=h:Set-Cookie, timestamp=1421027385487, value=PHPSESSID=5657f9f9da456a7bf6e243f78b7e0182; path=/ com.micmiu:http/ column=h:Vary, timestamp=1421027385487, value=Cookie com.micmiu:http/ column=h:X-Pingback, timestamp=1421027385487, value=http://www.micmiu.com/xmlrpc.php com.micmiu:http/ column=h:X-Powered-By, timestamp=1421027385487, value=PHP/5.3.29 com.micmiu:http/ column=mk:_ftcmrk_, timestamp=1421027385487, value=1421027229-1374349927 com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927 com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0 com.micmiu:http/ column=mtdt:___rdrdsc__, timestamp=1421027385487, value=y com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00 com.micmiu:http/ column=ol:http://www.micmiu.com/, timestamp=1421027385487, value= com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00 1 row(s) in 0.0980 seconds |
[第四步]:parse
1 2 3 4 5 6 7 8 9 |
$ nutch parse 1421027229-1374349927 -crawlId micmiublog ParserJob: starting ParserJob: resuming: false ParserJob: forced reparse: false ParserJob: batchId: 1421027229-1374349927 2015-01-12 09:50:03.525 java[14559:4756783] Unable to load realm info from SCDynamicStore Parsing http://micmiu.com/ http://micmiu.com/ skipped. Content of size 20 was truncated to 0 ParserJob: success |
查看HBase中得数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
hbase(main):020:0> scan 'micmiublog_webpage' ROW COLUMN+CELL com.micmiu:http/ column=f:bas, timestamp=1421027385487, value=http://micmiu.com/ com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927 com.micmiu:http/ column=f:cnt, timestamp=1421027385487, value= com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00 com.micmiu:http/ column=f:prot, timestamp=1421027385487, value=\x18\x02,http://www.micmiu.com/\x00\x00 com.micmiu:http/ column=f:pts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xCE\xBC\xF2 com.micmiu:http/ column=f:rpr, timestamp=1421027385487, value=http://micmiu.com/ com.micmiu:http/ column=f:st, timestamp=1421027385487, value=\x00\x00\x00\x05 com.micmiu:http/ column=f:ts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xD5\x17% com.micmiu:http/ column=f:typ, timestamp=1421027385487, value=text/html com.micmiu:http/ column=h:Cache-Control, timestamp=1421027385487, value=no-store, no-cache, must-revalidate, post-check=0, pre-check=0 com.micmiu:http/ column=h:Connection, timestamp=1421027385487, value=close com.micmiu:http/ column=h:Content-Encoding, timestamp=1421027385487, value=gzip com.micmiu:http/ column=h:Content-Length, timestamp=1421027385487, value=20 com.micmiu:http/ column=h:Content-Type, timestamp=1421027385487, value=text/html; charset=UTF-8 com.micmiu:http/ column=h:Date, timestamp=1421027385487, value=Mon, 12 Jan 2015 01:49:41 GMT com.micmiu:http/ column=h:Expires, timestamp=1421027385487, value=Thu, 19 Nov 1981 08:52:00 GMT com.micmiu:http/ column=h:Location, timestamp=1421027385487, value=http://www.micmiu.com/ com.micmiu:http/ column=h:Pragma, timestamp=1421027385487, value=no-cache com.micmiu:http/ column=h:Server, timestamp=1421027385487, value=LiteSpeed com.micmiu:http/ column=h:Set-Cookie, timestamp=1421027385487, value=PHPSESSID=5657f9f9da456a7bf6e243f78b7e0182; path=/ com.micmiu:http/ column=h:Vary, timestamp=1421027385487, value=Cookie com.micmiu:http/ column=h:X-Pingback, timestamp=1421027385487, value=http://www.micmiu.com/xmlrpc.php com.micmiu:http/ column=h:X-Powered-By, timestamp=1421027385487, value=PHP/5.3.29 com.micmiu:http/ column=mk:_ftcmrk_, timestamp=1421027385487, value=1421027229-1374349927 com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927 com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0 com.micmiu:http/ column=mtdt:___rdrdsc__, timestamp=1421027385487, value=y com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00 com.micmiu:http/ column=ol:http://www.micmiu.com/, timestamp=1421027385487, value= com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00 1 row(s) in 0.0690 seconds |
[第五步]:updatedb
1 2 3 4 |
$ nutch updatedb -crawlId micmiublog DbUpdaterJob: starting 2015-01-12 09:50:47.662 java[14572:4762452] Unable to load realm info from SCDynamicStore DbUpdaterJob: done |
查看HBase中得数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
hbase(main):021:0> scan 'micmiublog_webpage' ROW COLUMN+CELL com.micmiu.www:http/ column=f:fi, timestamp=1421027452042, value=\x00'\x8D\x00 com.micmiu.www:http/ column=f:st, timestamp=1421027452042, value=\x00\x00\x00\x01 com.micmiu.www:http/ column=f:ts, timestamp=1421027452042, value=\x00\x00\x01J\xDB\xD6$f com.micmiu.www:http/ column=mk:dist, timestamp=1421027452042, value=1 com.micmiu.www:http/ column=mtdt:_csh_, timestamp=1421027452042, value=?\x80\x00\x00 com.micmiu.www:http/ column=s:s, timestamp=1421027452042, value=?\x80\x00\x00 com.micmiu:http/ column=f:bas, timestamp=1421027385487, value=http://micmiu.com/ com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927 com.micmiu:http/ column=f:cnt, timestamp=1421027385487, value= com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00 com.micmiu:http/ column=f:prot, timestamp=1421027385487, value=\x18\x02,http://www.micmiu.com/\x00\x00 com.micmiu:http/ column=f:pts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xCE\xBC\xF2 com.micmiu:http/ column=f:rpr, timestamp=1421027385487, value=http://micmiu.com/ com.micmiu:http/ column=f:st, timestamp=1421027385487, value=\x00\x00\x00\x05 com.micmiu:http/ column=f:ts, timestamp=1421027452042, value=\x00\x00\x01KvS\xDF% com.micmiu:http/ column=f:typ, timestamp=1421027385487, value=text/html com.micmiu:http/ column=h:Cache-Control, timestamp=1421027385487, value=no-store, no-cache, must-revalidate, post-check=0, pre-check=0 com.micmiu:http/ column=h:Connection, timestamp=1421027385487, value=close com.micmiu:http/ column=h:Content-Encoding, timestamp=1421027385487, value=gzip com.micmiu:http/ column=h:Content-Length, timestamp=1421027385487, value=20 com.micmiu:http/ column=h:Content-Type, timestamp=1421027385487, value=text/html; charset=UTF-8 com.micmiu:http/ column=h:Date, timestamp=1421027385487, value=Mon, 12 Jan 2015 01:49:41 GMT com.micmiu:http/ column=h:Expires, timestamp=1421027385487, value=Thu, 19 Nov 1981 08:52:00 GMT com.micmiu:http/ column=h:Location, timestamp=1421027385487, value=http://www.micmiu.com/ com.micmiu:http/ column=h:Pragma, timestamp=1421027385487, value=no-cache com.micmiu:http/ column=h:Server, timestamp=1421027385487, value=LiteSpeed com.micmiu:http/ column=h:Set-Cookie, timestamp=1421027385487, value=PHPSESSID=5657f9f9da456a7bf6e243f78b7e0182; path=/ com.micmiu:http/ column=h:Vary, timestamp=1421027385487, value=Cookie com.micmiu:http/ column=h:X-Pingback, timestamp=1421027385487, value=http://www.micmiu.com/xmlrpc.php com.micmiu:http/ column=h:X-Powered-By, timestamp=1421027385487, value=PHP/5.3.29 com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0 com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00 com.micmiu:http/ column=ol:http://www.micmiu.com/, timestamp=1421027385487, value= com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00 2 row(s) in 0.1140 seconds |
—————– EOF @Michael Sun —————–
原创文章,转载请注明: 转载自micmiu – 软件开发+生活点滴[ http://www.micmiu.com/ ]
本文链接地址: http://www.micmiu.com/opensource/nutch/nutch2x-crawl-first-website/
0 条评论。