ここ数カ月、チームビルディングやソフトウェアエンジニアリングに駆り出され、業務でデータ分析ができていない。KaggleでQuora（自然言語処理のコンペ）が開始したので、奮起してとりあえずKernelを読んで勉強中。

色々忘れまくっていて、Couseraで身に付けた知識も消失しているので、メモします。

導入として読んだKernel

A Data Science Framework for Quora | Kaggle

データサイエンスのワークフロー

f:id:ishitonton:20181121213155p:plain

データ前処理項目

ターゲット列の削除
サンプリング
均衡 vs 不均衡（アンダーサンプリングやSMOTE）
欠損値処理（平均値による置き換えなど）
ノイズフィルタリング
標準化と正規化
PCA
特徴量選択

均衡 vs 不均衡

1つのクラスで90％、他方で10％の場合、標準的な最適化基準またはパフォーマンス指標が効果的ではない。不均衡なデータセットのパフォーマンス測定でaccuracyを利用すべきではない。

記号の扱い

stringライブラリを利用すると、記号一覧を取得できる。EDAにおけるノイズ除去に有用。

import string

$ string.punctuation
-> '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

ストップワード

NLTKライブラリを利用すると、英語のstopwordを取得できます。なおNLTKはNatural Language Toolkitの略。179文字のストップワードを取得できる。

from nltk.corpus import stopwords

$ eng_stopwords = set(stopwords.words("english"))
$ print(set(eng_stopwords))
->{"you've", 'an', 'at', 'through', 'over', 'am', "she's", 'any', 'theirs', 'now', 'we', 'it', 'being', 'for', 'doesn', 'if', 'there', "shouldn't", 'don', 'ma', 'my', 'her', 'this', 'that', 'won', 'then', 'more', 'such', 'had', 'most', "haven't", 'were', 'mightn', "mightn't", 'how', 're', 'yours', 'shan', 'you', 'isn', 'hers', "don't", 'a', 'just', 'below', 'down', "that'll", 'herself', 'than', 'only', 'few', 'does', 'out', "you'll", 'before', 'himself', 's', 'hasn', 'these', "hasn't", 'its', 'myself', 'against', 'but', 'off', 'whom', 'is', "you'd", 'some', 'ourselves', 'shouldn', "aren't", 'she', "should've", 'on', 'both', 'very', 'in', 'the', 'have', 'did', 'because', "couldn't", 'no', 'not', 'while', 'm', 'didn', 'to', 'under', "it's", 'once', 'needn', "didn't", 'other', "doesn't", 'are', "mustn't", 'all', 'been', 'who', 'as', 'during', 'above', 'and', 'why', 'do', 'with', 'until', 'from', 'again', 'between', 'after', 'which', 'your', 'will', 'wouldn', 'up', 'into', 'by', "wasn't", 'too', 'same', 'nor', "won't", 'about', "you're", 'here', 'themselves', 'has', 'i', 'each', 'd', 'further', 'was', 'those', 'o', 'couldn', "hadn't", 'them', 'aren', 'ain', 'yourselves', 'or', 've', 'what', 'mustn', 'should', 'our', "needn't", 'be', 'wasn', 'they', 'own', 'itself', 'so', 'doing', 'him', 'their', 'of', 't', 'hadn', 'where', "shan't", 'having', 'll', 'yourself', "wouldn't", 'when', 'weren', 'y', "isn't", 'ours', 'haven', 'his', 'me', "weren't", 'he', 'can'}

EDAの観点

自然言語処理でのEDAの観点をいくつかピックアップ

テキスト内の単語数
テキスト内のユニークな単語数
テキスト内の文字数
ストップワード数
句読点の数
大文字の単語数
先頭のみ大文字の単語数
単語の平均長

大文字、小文字判定

isupper()とislower()とistitle()を利用する。

$ test = ['ISHIO', 'Ishio', 'ishio']
$ for t in test:
      print(t.isupper(), t.islower(), t.istitle())
->
True False False
False False True
False True False

For Your ISHIO Blog

データ分析や機械学習やスクラムや組織とか、色々つぶやくブログです。

【備忘録】Kernel：A Data Science Framework for Quoraを読んだ