Nutch2.x 演示抓取第一个网站

$ nutch
Usage: nutch COMMAND
where COMMAND is one of:
 inject		inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate 	generate new batches to fetch from crawl db
 fetch 		fetch URLs marked during generate
 parse 		parse URLs marked during fetch
 updatedb 	update web table after parsing
 updatehostdb   update host table after parsing
 readdb 	read/dump records from page database
 readhostdb     display entries from the hostDB
 elasticindex   run the elasticsearch indexer
 solrindex 	run the solr indexer on parsed batches
 solrdedup 	remove duplicates from solr
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin 	load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 junit         	runs the given JUnit test
 or
 CLASSNAME 	run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

$ nutch

Usage: nutch COMMAND

where COMMAND is one of:

inject inject new urls into the database

hostinject creates or updates an existing host table from a text file

generate generate new batches to fetch from crawl db

fetch fetch URLs marked during generate

parse parse URLs marked during fetch

updatedb update web table after parsing

updatehostdb update host table after parsing

readdb read/dump records from page database

readhostdb display entries from the hostDB

elasticindex run the elasticsearch indexer

solrindex run the solr indexer on parsed batches

solrdedup remove duplicates from solr

parsechecker check the parser for a given url

indexchecker check the indexing filters for a given url

plugin load a plugin and run one of its classes main()

nutchserver run a (local) Nutch server on a user defined port

junit runs the given JUnit test

or

CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

$ crawl 
Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

1 2	$ crawl Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

$ nutch inject urls -crawlId micmiublog
InjectorJob: starting at 2015-01-12 09:42:46
InjectorJob: Injecting urlDir: urls
2015-01-12 09:42:47.096 java[14509:4735452] Unable to load realm info from SCDynamicStore
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1

1

2

3

4

5

6

7

$ nutch inject urls -crawlId micmiublog

InjectorJob: starting at 2015-01-12 09:42:46

InjectorJob: Injecting urlDir: urls

2015-01-12 09:42:47.096 java[14509:4735452] Unable to load realm info from SCDynamicStore

InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 1

hbase(main):016:0> scan 'micmiublog_webpage'
ROW COLUMN+CELL 
 com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00 
 com.micmiu:http/ column=f:ts, timestamp=1421026970740, value=\x00\x00\x01J\xDB\xCE\xBC\xF2 
 com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y 
 com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0 
 com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00 
 com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00 
1 row(s) in 0.1010 seconds

1

2

3

4

5

6

7

8

9

hbase(main):016:0> scan 'micmiublog_webpage'

ROW COLUMN+CELL

com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00

com.micmiu:http/ column=f:ts, timestamp=1421026970740, value=\x00\x00\x01J\xDB\xCE\xBC\xF2

com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y

com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0

com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00

com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00

1 row(s) in 0.1010 seconds

$ nutch generate -topN 5 -crawlId micmiublog
GeneratorJob: starting at 2015-01-12 09:47:09
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 5
2015-01-12 09:47:09.822 java[14533:4744993] Unable to load realm info from SCDynamicStore
GeneratorJob: finished at 2015-01-12 09:47:13, time elapsed: 00:00:03
GeneratorJob: generated batch id: 1421027229-1374349927

1

2

3

4

5

6

7

8

9

10

$ nutch generate -topN 5 -crawlId micmiublog

GeneratorJob: starting at 2015-01-12 09:47:09

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: true

GeneratorJob: normalizing: true

GeneratorJob: topN: 5

2015-01-12 09:47:09.822 java[14533:4744993] Unable to load realm info from SCDynamicStore

GeneratorJob: finished at 2015-01-12 09:47:13, time elapsed: 00:00:03

GeneratorJob: generated batch id: 1421027229-1374349927

hbase(main):018:0> scan 'micmiublog_webpage'
ROW COLUMN+CELL 
 com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927 
 com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00 
 com.micmiu:http/ column=f:ts, timestamp=1421026970740, value=\x00\x00\x01J\xDB\xCE\xBC\xF2 
 com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927 
 com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y 
 com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0 
 com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00 
 com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00 
1 row(s) in 0.0580 seconds

1

2

3

4

5

6

7

8

9

10

11

hbase(main):018:0> scan 'micmiublog_webpage'

ROW COLUMN+CELL

com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927

com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00

com.micmiu:http/ column=f:ts, timestamp=1421026970740, value=\x00\x00\x01J\xDB\xCE\xBC\xF2

com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927

com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y

com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0

com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00

com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00

1 row(s) in 0.0580 seconds

hbase(main):025:0> get 'micmiublog_webpage','com.micmiu:http/',{COLUMNS => 'f:bid'}
COLUMN  CELL                                                                                                    
 f:bid  timestamp=1421027232815, value=1421027229-1374349927                                                    
1 row(s) in 0.0060 seconds

1

2

3

4

hbase(main):025:0> get 'micmiublog_webpage','com.micmiu:http/',{COLUMNS => 'f:bid'}

COLUMN CELL

f:bid timestamp=1421027232815, value=1421027229-1374349927

1 row(s) in 0.0060 seconds

$ nutch fetch 1421027229-1374349927 -crawlId micmiublog -threads 10
FetcherJob: starting
FetcherJob: batchId: 1421027229-1374349927
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
2015-01-12 09:49:37.095 java[14546:4753667] Unable to load realm info from SCDynamicStore
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://micmiu.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread2, activeThreads=1
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread5, activeThreads=1
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

$ nutch fetch 1421027229-1374349927 -crawlId micmiublog -threads 10

FetcherJob: starting

FetcherJob: batchId: 1421027229-1374349927

FetcherJob: threads: 10

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : -1

2015-01-12 09:49:37.095 java[14546:4753667] Unable to load realm info from SCDynamicStore

Using queue mode : byHost

Fetcher: threads: 10

QueueFeeder finished: total 1 records. Hit by time limit :0

fetching http://micmiu.com/ (queue crawl delay=5000ms)

-finishing thread FetcherThread1, activeThreads=1

-finishing thread FetcherThread2, activeThreads=1

-finishing thread FetcherThread3, activeThreads=1

-finishing thread FetcherThread4, activeThreads=1

-finishing thread FetcherThread5, activeThreads=1

-finishing thread FetcherThread6, activeThreads=1

-finishing thread FetcherThread7, activeThreads=1

-finishing thread FetcherThread8, activeThreads=1

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

-finishing thread FetcherThread9, activeThreads=1

-finishing thread FetcherThread0, activeThreads=0

0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues

-activeThreads=0

FetcherJob: done

hbase(main):019:0> scan 'micmiublog_webpage'
ROW                COLUMN+CELL                                                                                             
 com.micmiu:http/  column=f:bas, timestamp=1421027385487, value=http://micmiu.com/                                         
 com.micmiu:http/  column=f:bid, timestamp=1421027232815, value=1421027229-1374349927                                      
 com.micmiu:http/  column=f:cnt, timestamp=1421027385487, value=                                                           
 com.micmiu:http/  column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00                                               
 com.micmiu:http/  column=f:prot, timestamp=1421027385487, value=\x18\x02,http://www.micmiu.com/\x00\x00                   
 com.micmiu:http/  column=f:pts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xCE\xBC\xF2                              
 com.micmiu:http/  column=f:rpr, timestamp=1421027385487, value=http://micmiu.com/                                         
 com.micmiu:http/  column=f:st, timestamp=1421027385487, value=\x00\x00\x00\x05                                            
 com.micmiu:http/  column=f:ts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xD5\x17%                                  
 com.micmiu:http/  column=f:typ, timestamp=1421027385487, value=text/html                                                  
 com.micmiu:http/  column=h:Cache-Control, timestamp=1421027385487, value=no-store, no-cache, must-revalidate, post-check=0, pre-check=0                                                                                           
 com.micmiu:http/  column=h:Connection, timestamp=1421027385487, value=close                                               
 com.micmiu:http/  column=h:Content-Encoding, timestamp=1421027385487, value=gzip                                          
 com.micmiu:http/  column=h:Content-Length, timestamp=1421027385487, value=20                                              
 com.micmiu:http/  column=h:Content-Type, timestamp=1421027385487, value=text/html; charset=UTF-8                          
 com.micmiu:http/  column=h:Date, timestamp=1421027385487, value=Mon, 12 Jan 2015 01:49:41 GMT                             
 com.micmiu:http/  column=h:Expires, timestamp=1421027385487, value=Thu, 19 Nov 1981 08:52:00 GMT                          
 com.micmiu:http/  column=h:Location, timestamp=1421027385487, value=http://www.micmiu.com/                                
 com.micmiu:http/  column=h:Pragma, timestamp=1421027385487, value=no-cache                                                
 com.micmiu:http/  column=h:Server, timestamp=1421027385487, value=LiteSpeed                                               
 com.micmiu:http/  column=h:Set-Cookie, timestamp=1421027385487, value=PHPSESSID=5657f9f9da456a7bf6e243f78b7e0182; path=/  
 com.micmiu:http/  column=h:Vary, timestamp=1421027385487, value=Cookie                                                    
 com.micmiu:http/  column=h:X-Pingback, timestamp=1421027385487, value=http://www.micmiu.com/xmlrpc.php                    
 com.micmiu:http/  column=h:X-Powered-By, timestamp=1421027385487, value=PHP/5.3.29                                        
 com.micmiu:http/  column=mk:_ftcmrk_, timestamp=1421027385487, value=1421027229-1374349927                                
 com.micmiu:http/  column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927                                 
 com.micmiu:http/  column=mk:_injmrk_, timestamp=1421026970740, value=y                                                    
 com.micmiu:http/  column=mk:dist, timestamp=1421026970740, value=0                                                        
 com.micmiu:http/  column=mtdt:___rdrdsc__, timestamp=1421027385487, value=y                                               
 com.micmiu:http/  column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00                                         
 com.micmiu:http/  column=ol:http://www.micmiu.com/, timestamp=1421027385487, value=                                       
 com.micmiu:http/  column=s:s, timestamp=1421026970740, value=?\x80\x00\x00                                                
1 row(s) in 0.0980 seconds

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

hbase(main):019:0> scan 'micmiublog_webpage'

ROW COLUMN+CELL

com.micmiu:http/ column=f:bas, timestamp=1421027385487, value=http://micmiu.com/

com.micmiu:http/ column=f:bid, timestamp=1421027232815, value=1421027229-1374349927

com.micmiu:http/ column=f:cnt, timestamp=1421027385487, value=

com.micmiu:http/ column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00

com.micmiu:http/ column=f:prot, timestamp=1421027385487, value=\x18\x02,http://www.micmiu.com/\x00\x00

com.micmiu:http/ column=f:pts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xCE\xBC\xF2

com.micmiu:http/ column=f:rpr, timestamp=1421027385487, value=http://micmiu.com/

com.micmiu:http/ column=f:st, timestamp=1421027385487, value=\x00\x00\x00\x05

com.micmiu:http/ column=f:ts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xD5\x17%

com.micmiu:http/ column=f:typ, timestamp=1421027385487, value=text/html

com.micmiu:http/ column=h:Cache-Control, timestamp=1421027385487, value=no-store, no-cache, must-revalidate, post-check=0, pre-check=0

com.micmiu:http/ column=h:Connection, timestamp=1421027385487, value=close

com.micmiu:http/ column=h:Content-Encoding, timestamp=1421027385487, value=gzip

com.micmiu:http/ column=h:Content-Length, timestamp=1421027385487, value=20

com.micmiu:http/ column=h:Content-Type, timestamp=1421027385487, value=text/html; charset=UTF-8

com.micmiu:http/ column=h:Date, timestamp=1421027385487, value=Mon, 12 Jan 2015 01:49:41 GMT

com.micmiu:http/ column=h:Expires, timestamp=1421027385487, value=Thu, 19 Nov 1981 08:52:00 GMT

com.micmiu:http/ column=h:Location, timestamp=1421027385487, value=http://www.micmiu.com/

com.micmiu:http/ column=h:Pragma, timestamp=1421027385487, value=no-cache

com.micmiu:http/ column=h:Server, timestamp=1421027385487, value=LiteSpeed

com.micmiu:http/ column=h:Set-Cookie, timestamp=1421027385487, value=PHPSESSID=5657f9f9da456a7bf6e243f78b7e0182; path=/

com.micmiu:http/ column=h:Vary, timestamp=1421027385487, value=Cookie

com.micmiu:http/ column=h:X-Pingback, timestamp=1421027385487, value=http://www.micmiu.com/xmlrpc.php

com.micmiu:http/ column=h:X-Powered-By, timestamp=1421027385487, value=PHP/5.3.29

com.micmiu:http/ column=mk:_ftcmrk_, timestamp=1421027385487, value=1421027229-1374349927

com.micmiu:http/ column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927

com.micmiu:http/ column=mk:_injmrk_, timestamp=1421026970740, value=y

com.micmiu:http/ column=mk:dist, timestamp=1421026970740, value=0

com.micmiu:http/ column=mtdt:___rdrdsc__, timestamp=1421027385487, value=y

com.micmiu:http/ column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00

com.micmiu:http/ column=ol:http://www.micmiu.com/, timestamp=1421027385487, value=

com.micmiu:http/ column=s:s, timestamp=1421026970740, value=?\x80\x00\x00

1 row(s) in 0.0980 seconds

$ nutch parse 1421027229-1374349927 -crawlId micmiublog
ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1421027229-1374349927
2015-01-12 09:50:03.525 java[14559:4756783] Unable to load realm info from SCDynamicStore
Parsing http://micmiu.com/
http://micmiu.com/ skipped. Content of size 20 was truncated to 0
ParserJob: success

1

2

3

4

5

6

7

8

9

$ nutch parse 1421027229-1374349927 -crawlId micmiublog

ParserJob: starting

ParserJob: resuming: false

ParserJob: forced reparse: false

ParserJob: batchId: 1421027229-1374349927

2015-01-12 09:50:03.525 java[14559:4756783] Unable to load realm info from SCDynamicStore

Parsing http://micmiu.com/

http://micmiu.com/ skipped. Content of size 20 was truncated to 0

ParserJob: success

hbase(main):020:0> scan 'micmiublog_webpage'
ROW                COLUMN+CELL                                                                                             
 com.micmiu:http/  column=f:bas, timestamp=1421027385487, value=http://micmiu.com/                                         
 com.micmiu:http/  column=f:bid, timestamp=1421027232815, value=1421027229-1374349927                                      
 com.micmiu:http/  column=f:cnt, timestamp=1421027385487, value=                                                           
 com.micmiu:http/  column=f:fi, timestamp=1421026970740, value=\x00'\x8D\x00                                               
 com.micmiu:http/  column=f:prot, timestamp=1421027385487, value=\x18\x02,http://www.micmiu.com/\x00\x00                   
 com.micmiu:http/  column=f:pts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xCE\xBC\xF2                              
 com.micmiu:http/  column=f:rpr, timestamp=1421027385487, value=http://micmiu.com/                                         
 com.micmiu:http/  column=f:st, timestamp=1421027385487, value=\x00\x00\x00\x05                                            
 com.micmiu:http/  column=f:ts, timestamp=1421027385487, value=\x00\x00\x01J\xDB\xD5\x17%                                  
 com.micmiu:http/  column=f:typ, timestamp=1421027385487, value=text/html                                                  
 com.micmiu:http/  column=h:Cache-Control, timestamp=1421027385487, value=no-store, no-cache, must-revalidate, post-check=0, pre-check=0                                                                                           
 com.micmiu:http/  column=h:Connection, timestamp=1421027385487, value=close                                               
 com.micmiu:http/  column=h:Content-Encoding, timestamp=1421027385487, value=gzip                                          
 com.micmiu:http/  column=h:Content-Length, timestamp=1421027385487, value=20                                              
 com.micmiu:http/  column=h:Content-Type, timestamp=1421027385487, value=text/html; charset=UTF-8                          
 com.micmiu:http/  column=h:Date, timestamp=1421027385487, value=Mon, 12 Jan 2015 01:49:41 GMT                             
 com.micmiu:http/  column=h:Expires, timestamp=1421027385487, value=Thu, 19 Nov 1981 08:52:00 GMT                          
 com.micmiu:http/  column=h:Location, timestamp=1421027385487, value=http://www.micmiu.com/                                
 com.micmiu:http/  column=h:Pragma, timestamp=1421027385487, value=no-cache                                                
 com.micmiu:http/  column=h:Server, timestamp=1421027385487, value=LiteSpeed                                               
 com.micmiu:http/  column=h:Set-Cookie, timestamp=1421027385487, value=PHPSESSID=5657f9f9da456a7bf6e243f78b7e0182; path=/  
 com.micmiu:http/  column=h:Vary, timestamp=1421027385487, value=Cookie                                                    
 com.micmiu:http/  column=h:X-Pingback, timestamp=1421027385487, value=http://www.micmiu.com/xmlrpc.php                    
 com.micmiu:http/  column=h:X-Powered-By, timestamp=1421027385487, value=PHP/5.3.29                                        
 com.micmiu:http/  column=mk:_ftcmrk_, timestamp=1421027385487, value=1421027229-1374349927                                
 com.micmiu:http/  column=mk:_gnmrk_, timestamp=1421027232815, value=1421027229-1374349927                                 
 com.micmiu:http/  column=mk:_injmrk_, timestamp=1421026970740, value=y                                                    
 com.micmiu:http/  column=mk:dist, timestamp=1421026970740, value=0                                                        
 com.micmiu:http/  column=mtdt:___rdrdsc__, timestamp=1421027385487, value=y                                               
 com.micmiu:http/  column=mtdt:_csh_, timestamp=1421026970740, value=?\x80\x00\x00                                         
 com.micmiu:http/  column=ol:http://www.micmiu.com/, timestamp=1421027385487, value=                                       
 com.micmiu:http/  column=s:s, timestamp=1421026970740, value=?\x80\x00\x00                                                
1 row(s) in 0.0690 seconds

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

hbase(main):020:0> scan 'micmiublog_webpage'