Nutch 抓取时错误提示信息:
1 2 3 4 5 6 7 8 9 10 |
FetcherJob: starting FetcherJob: batchId: 1420598193-2940 Fetcher: No agents listed in 'http.agent.name' property. Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property. at org.apache.nutch.fetcher.FetcherJob.checkConfiguration(FetcherJob.java:240) at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:152) at org.apache.nutch.fetcher.FetcherJob.fetch(FetcherJob.java:219) at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:301) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.FetcherJob.main(FetcherJob.java:307) |
原因: 没有配置 http.agent.name
属性值
解决办法: 打开 $NUTCH_HOME/conf/nutch-site.xml
添加如下内容:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
<property> <name>http.agent.name</name> <value>micmiu</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> <property> <name>http.robots.agents</name> <value>micmiu,*</value> <description>The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* </description> </property> |
ps: 如果不配置 http.robots.agents
属性值,会提示错误信息:
Your ‘http.agent.name’ value should be listed first in ‘http.robots.agents’ property
—————– EOF @Michael Sun —————–
0 条评论。