建立 自动化抓取的过程
访问量: 2725
1. 创建 code:text 这样的表
$ bundle exec rails g scaffold fetch_rules code:text
2. 向其中添加: 不同内容写到不同的分类中( 例如,你有两个表: 分类表,新闻表. )我们想不同的分类中抓取不同的内容.
$ bundle exec rails g migration add_category_id_to_fetch_rules
3. 分析 目标网站, 写好抓取脚本. 如下:
site = 'http://www.phnompenhpost.com' url = 'http://www.phnompenhpost.com/national' response = HTTParty.get url doc = Nokogiri::HTML(response.body) target_urls = doc.css('.category a').map do |link| link['href'] end target_urls.each do | article_url | Rails.logger.info "== article_url: #{site}#{article_url}" article_response = HTTParty.get (site + article_url ) # 这里在第一次写的时候容易出错, 所以直接加上了 article_html = Nokogiri::HTML(article_response.body).css('#ArticleBody') article_title = Nokogiri::HTML(article_response.body).css('.node-title').text().strip Article.create! :content => article_html, # 这里只有正文和 标题是重要的.其他的都可以根据你的model情况来自行处理 :title => article_title, :link => site + article_url, :language => 'en', :category_id => 1, :source => site end
4. 进入到 fetch_rule编辑页中,保存代码
5. 修改 fetch_rules_controller, 增加action:
def start begin Fetcher.new.start @fetch_rule rescue Exception => e Rails.logger.error e Rails.logger.error e.backtrace.join("\n") # 这里很重要.可以及时打印出错误日志 end redirect_to :back, alert: "操作成功" end
6. 增加 Fetcher: lib/fetcher.rb
# -*- encoding : utf-8 -*- require 'sourcify' require 'httparty' class Fetcher def start fetch_rule Rails.logger.info "== fetch starts " eval(fetch_rule.code) Rails.logger.info "== fetch ends" end end
7. 同时,在 config/application.rb中,添加这一行: (目的是为了加载上面的 lib/fetcher.rb )
config.autoload_paths += %W(#{config.root}/lib)
8. 在页面上添加 按钮:
<%= link_to "fetch", start_fetch_rules_path(:id => @fetch_rule), :method => 'post', :confirm => 'are you sure to fetch?' %>
9. 添加config/routes.rb :
resources :fetch_rules do collection do post :start # 这里就是对应上面的 抓取按钮. end end
10. 运行! 看日志,发现一大堆错误.
09:55:36 INFO: == fetch starts 09:55:36 ERROR: undefined method `+' for nil:NilClass 09:55:36 ERROR: /workspace/topgroup_web/lib/fetcher.rb:8:in `eval' /home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1465:in `begin_transport' /home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1410:in `transport_request' /home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1384:in `request' /home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1377:in `block in request' /home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:853:in `start' /home/siwei/.rbenv/versions/2.2.4/lib/ruby/2.2.0/net/http.rb:1375:in `request' /home/siwei/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/httparty-0.13.7/lib/httparty/request.rb:117:in `perform' /home/siwei/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/httparty-0.13.7/lib/httparty.rb:545:in `perform_request' /home/siwei/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/httparty-0.13.7/lib/httparty.rb:476:in `get' /home/siwei/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/httparty-0.13.7/lib/httparty.rb:583:in `get' (eval):14:in `block in start' (eval):12:in `each' (eval):12:in `start' /workspace/topgroup_web/lib/fetcher.rb:8:in `eval' /workspace/topgroup_web/lib/fetcher.rb:8:in `start' /workspace/topgroup_web/app/controllers/fetch_rules_controller.rb:51:in `start'
发现 (eval) : 14 行有问题. 所以要加上logger 来判断(后来发现是 url 不对,url没有包含域名)