建立　自动化抓取的过程

2017-03-08 15:23

访问量: 2545

1. 创建　code:text 这样的表

$ bundle exec rails g scaffold fetch_rules code:text

2. 向其中添加：　不同内容写到不同的分类中（　例如，你有两个表：　　分类表，新闻表．　　）我们想不同的分类中抓取不同的内容．

$ bundle exec rails g migration add_category_id_to_fetch_rules

3. 分析　目标网站，　写好抓取脚本．如下：

site = 'http://www.phnompenhpost.com'
url = 'http://www.phnompenhpost.com/national'

response = HTTParty.get url 
doc = Nokogiri::HTML(response.body)

target_urls = doc.css('.category a').map do  |link|
    link['href']
end

target_urls.each do | article_url | 

   Rails.logger.info "== article_url: #{site}#{article_url}"
   article_response = HTTParty.get (site + article_url )   # 这里在第一次写的时候容易出错，　所以直接加上了
   article_html = Nokogiri::HTML(article_response.body).css('#ArticleBody')
   article_title = Nokogiri::HTML(article_response.body).css('.node-title').text().strip
   Article.create! :content => article_html,     # 这里只有正文和　标题是重要的．其他的都可以根据你的model情况来自行处理
       :title => article_title,
       :link => site + article_url, 
       :language => 'en', 
       :category_id => 1,
       :source => site
 end

4. 进入到　fetch_rule编辑页中，保存代码

5. 修改　fetch_rules_controller, 增加action:

  def start
    begin
      Fetcher.new.start @fetch_rule
    rescue Exception => e
      Rails.logger.error e
      Rails.logger.error e.backtrace.join("\n")  # 这里很重要．可以及时打印出错误日志
    end 
    redirect_to :back, alert: "操作成功"
  end

6. 增加 Fetcher: lib/fetcher.rb

# -*- encoding : utf-8 -*-
require 'sourcify'
require 'httparty'
class Fetcher
  def start fetch_rule
    Rails.logger.info "== fetch starts "
    eval(fetch_rule.code)
    Rails.logger.info "== fetch ends"
  end 
end

7. 同时，在　config/application.rb中，添加这一行：　（目的是为了加载上面的 lib/fetcher.rb )

config.autoload_paths += %W(#{config.root}/lib)

8. 在页面上添加　按钮：

<%= link_to "fetch", start_fetch_rules_path(:id => @fetch_rule), :method => 'post',  
  :confirm => 'are you sure to fetch?' %>

9. 添加config/routes.rb :

  resources :fetch_rules do
    collection do
      post :start    # 这里就是对应上面的　抓取按钮．
    end
  end

10. 运行！　看日志，发现一大堆错误．　

09:55:36 INFO: == fetch starts  
09:55:36 ERROR: undefined method `+' for nil:NilClass 
09:55:36 ERROR: /workspace/topgroup_web/lib/fetcher.rb:8:in `eval'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1465:in `begin_transport'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1410:in `transport_request'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1384:in `request'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1377:in `block in request'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:853:in `start'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1375:in `request'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/httparty-0.13.7/lib/httparty/request.rb:117:in `perform'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/httparty-0.13.7/lib/httparty.rb:545:in `perform_request'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/httparty-0.13.7/lib/httparty.rb:476:in `get'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/httparty-0.13.7/lib/httparty.rb:583:in `get'
(eval):14:in `block in start'
(eval):12:in `each'
(eval):12:in `start'
/workspace/topgroup_web/lib/fetcher.rb:8:in `eval'
/workspace/topgroup_web/lib/fetcher.rb:8:in `start'
/workspace/topgroup_web/app/controllers/fetch_rules_controller.rb:51:in `start'

发现 (eval) : 14 行有问题．　　所以要加上logger 来判断（后来发现是　url 不对，ｕｒｌ没有包含域名)

订阅/RSS Feed