青空文庫のデータを形態素解析・感情分析してみた

青空文庫の文書データを形態素解析し、分割された単語ごとにネガティブかポジティブかの判定を行なって、スコアを計算させてみました。

形態素解析にはMecabを、ネガポジの判定には下記の単語感情極性対応表を使わせていただきました。

単語感情極性対応表　:　http://www.lr.pi.titech.ac.jp/~takamura/pndic_ja.html

Rubyで青空文庫の解析したい文書のページをスクレイピングして、HTML構造解析で本文のみ抽出し、形態素解析・ネガポジ判定を行なうプログラムを作成します。

ちなみにスクレピングにはRubyには便利なWebクローリング＆スクレイピングモジュールとして「Anemone」というものがありますので、こちらを使います。

Anemone　:　https://github.com/chriskite/anemone

# -*- coding: utf-8 -*-
require 'kconv'
require 'bundler'
Bundler.require
STDOUT.sync = true
url = ARGV[0]
negapozi_master = Hash.new()
File.open("pn_ja.dic", "r:UTF-8") do |f|
    while line = f.gets
        arr = line.gsub(/(\r|\n)/, "").split(":")
        negapozi_master[arr[0]] = arr[arr.size - 1].to_f
    end
end
mecab = Natto::MeCab.new
options = {
    :depth_limit => 0,
    :skip_query_strings => false,
    :read_timeout => 60,
}
title = ""
author = ""
surface_cnt = Hash.new()
feature_cnt = Hash.new()
negapozi_score = 0
Anemone.crawl(url, options) do |anemone|
    anemone.skip_links_like /.*\.doc|.*\.jpg|.*\.png|.*\.gif|.*\.pdf|.*\.zip/
    anemone.on_every_page do |page|
        doc = Nokogiri::HTML.parse(page.body)
        doc.xpath("//h1[@class='title']").each do |element|
            text = element.content
            text = text.gsub(/(\t|\s|\n|\r|\f|\v)/, "")
            title = text
        end
        doc.xpath("//h2[@class='author']").each do |element|
            text = element.content
            text = text.gsub(/(\t|\s|\n|\r|\f|\v)/, "")
            author = text
        end
        doc.xpath("//div[@class='main_text']").each do |element|
            text = element.content
            text.force_encoding("UTF-8")
            text = text.scrub("?")
            text = text.gsub(/*<\/rt>/, "")
            text = text.gsub(/<\/?[^>]*>/, "")
            text = text.gsub(/(\t|\s|\n|\r|\f|\v)/, "")
            text = text.gsub(/(\(|\))/, "")
            mecab.parse(text) do |node|
                next if node.surface.nil? || node.feature.nil?
                surface = node.surface
                feature = node.feature.split(",")[0]
                next if feature == "記号" || surface == "、" || surface == "。"
                surface_cnt[surface] = surface_cnt.has_key?(surface) ? surface_cnt[surface] + 1 : 1
                feature_cnt[feature] = feature_cnt.has_key?(feature) ? feature_cnt[feature] + 1 : 1
                negapozi_score = negapozi_score + negapozi_master[surface] if negapozi_master.has_key?(surface)
            end
        end
    end
end
surface_cnt = surface_cnt.sort_by{|key, value| -value}
feature_cnt = feature_cnt.sort_by{|key, value| -value}
puts  "title : " + title + ", author: " + author
puts "++++ surface count ++++"
cnt = 0
surface_cnt.each do |key, value|
    cnt = cnt + 1
    break if cnt > 5
    puts key + ": " + value.to_s
end
puts "++++ feature count ++++"
feature_cnt.each do |key, value|
    cnt = cnt + 1
    break if cnt > 5
    puts key + " : " + value.to_s
end
puts "++++ negapozi score ++++"
puts negapozi_score.to_s
exit

いくつかのページに対して実行してみました。

# ruby sc.rb http://www.aozora.gr.jp/cards/000035/files/301_14912.html
title : 人間失格, author: 太宰治
++++ surface count ++++
の: 2672
に: 1696
て: 1606
は: 1366
た: 1333
++++ feature count ++++
助詞 : 12749
名詞 : 12542
動詞 : 5866
助動詞 : 4499
副詞 : 1316
++++ negapozi score ++++
-5414.021331189988
# ruby sc.rb http://www.aozora.gr.jp/cards/000081/files/456_15050.html
title : 銀河鉄道の夜, author: 宮沢賢治
++++ surface count ++++
の: 1275
た: 948
て: 866
に: 780
は: 624
++++ feature count ++++
名詞 : 6536
助詞 : 6484
動詞 : 3305
助動詞 : 2651
副詞 : 925
++++ negapozi score ++++
-2613.359258273997
# ruby sc.rb http://www.aozora.gr.jp/cards/000140/files/50131_42408.html
title : アーサー王物語, author: テニソン　Tennyson
++++ surface count ++++
e: 825
t: 637
h: 605
a: 597
の: 562
++++ feature count ++++
名詞 : 13347
助詞 : 3164
動詞 : 1365
助動詞 : 806
副詞 : 200
++++ negapozi score ++++
-1171.1144158299985

ネガティブですねー。

単語数や品詞数は上位５番目まで出してみましたが、特に面白いものはなかった感じです。

ネガポジ判定は、MeCabのユーザー辞書を作成して、当てる方が確実だったかと思います。

結構ネガポジの表自体がネガティブ寄りなので、アーサー王物語は英語が多い分、スコアも下がらなかったのでしょうか。

文章量が多ければ多いほどネガティブになりそうなので、量で割るなどの正規化が必要かと思います。

プライバシーポリシー

掲載されている広告について

当サイトでは、第三者配信の広告サービス（Googleアドセンス）を利用しています。このような広告配信事業者は、ユーザーの興味に応じた商品やサービスの広告を表示するため、当サイトや他サイトへのアクセス情報『Cookie』(氏名、住所、メールアドレス、電話番号は含まれません) を使用することがあります。Googleアドセンスに関する詳細やこのような情報が広告配信事業者に使用されないようにする方法については、こちらをご参照ください。

使用しているアクセス解析ツールについて

当サイトでは、Googleによるアクセス解析ツール「Googleアナリティクス」を利用しています。このGoogleアナリティクスはトラフィックデータの収集のために『Cookie』を使用しています。このトラフィックデータは匿名で収集されており、個人を特定するものではありません。また『Cookie』を無効にすることで収集を拒否することが出来ますので、お使いのブラウザの設定をご確認ください。この規約に関して、詳しくはこちらをご覧ください。