建立 自动化抓取的过程

访问量: 417

1. 创建 code:text 这样的表

$ bundle exec rails g scaffold fetch_rules code:text

2. 向其中添加: 不同内容写到不同的分类中( 例如,你有两个表:  分类表,新闻表.  )我们想不同的分类中抓取不同的内容.

$ bundle exec rails g migration add_category_id_to_fetch_rules

3. 分析 目标网站, 写好抓取脚本. 如下:

site = 'http://www.phnompenhpost.com'
url = 'http://www.phnompenhpost.com/national'

response = HTTParty.get url 
doc = Nokogiri::HTML(response.body)

target_urls = doc.css('.category a').map do  |link|
    link['href']
end

target_urls.each do | article_url | 

   Rails.logger.info "== article_url: #{site}#{article_url}"
   article_response = HTTParty.get (site + article_url )   # 这里在第一次写的时候容易出错, 所以直接加上了
   article_html = Nokogiri::HTML(article_response.body).css('#ArticleBody')
   article_title = Nokogiri::HTML(article_response.body).css('.node-title').text().strip
   Article.create! :content => article_html,     # 这里只有正文和 标题是重要的.其他的都可以根据你的model情况来自行处理
       :title => article_title,
       :link => site + article_url, 
       :language => 'en', 
       :category_id => 1,
       :source => site
 end


4. 进入到 fetch_rule编辑页中,保存代码

5. 修改 fetch_rules_controller, 增加action: 

  def start
    begin
      Fetcher.new.start @fetch_rule
    rescue Exception => e
      Rails.logger.error e
      Rails.logger.error e.backtrace.join("\n")  # 这里很重要.可以及时打印出错误日志
    end 
    redirect_to :back, alert: "操作成功"
  end 

6. 增加 Fetcher:    lib/fetcher.rb

# -*- encoding : utf-8 -*-
require 'sourcify'
require 'httparty'
class Fetcher
  def start fetch_rule
    Rails.logger.info "== fetch starts "
    eval(fetch_rule.code)
    Rails.logger.info "== fetch ends"
  end 
end

7. 同时,在 config/application.rb中,添加这一行: (目的是为了加载上面的 lib/fetcher.rb )

config.autoload_paths += %W(#{config.root}/lib)

8. 在页面上添加 按钮:

<%= link_to "fetch", start_fetch_rules_path(:id => @fetch_rule), :method => 'post',  
  :confirm => 'are you sure to fetch?' %>

9. 添加config/routes.rb :

  resources :fetch_rules do
    collection do
      post :start    # 这里就是对应上面的 抓取按钮.
    end
  end

10. 运行! 看日志,发现一大堆错误. 

09:55:36 INFO: == fetch starts  
09:55:36 ERROR: undefined method `+' for nil:NilClass 
09:55:36 ERROR: /workspace/topgroup_web/lib/fetcher.rb:8:in `eval'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1465:in `begin_transport'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1410:in `transport_request'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1384:in `request'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1377:in `block in request'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:853:in `start'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1375:in `request'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/httparty-0.13.7/lib/httparty/request.rb:117:in `perform'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/httparty-0.13.7/lib/httparty.rb:545:in `perform_request'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/httparty-0.13.7/lib/httparty.rb:476:in `get'
/home/siwei/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/httparty-0.13.7/lib/httparty.rb:583:in `get'
(eval):14:in `block in start'
(eval):12:in `each'
(eval):12:in `start'
/workspace/topgroup_web/lib/fetcher.rb:8:in `eval'
/workspace/topgroup_web/lib/fetcher.rb:8:in `start'
/workspace/topgroup_web/app/controllers/fetch_rules_controller.rb:51:in `start'

发现 (eval) : 14 行有问题.  所以要加上logger 来判断(后来发现是 url 不对,url没有包含域名)

订阅/RSS Feed

Subscribe

分类/category