Amazon SageMaker Processing & spaCyで、段落のテキストを1行1文のデータに前処理する

Daisuke Asada

2022-08-28

Word2Vecなどのアルゴリズムで単語ベクトル用のモデルを学習する際などに、テキストデータを1行1文のテキストデータに変換したいケースがあります。そこで、段落単位になっているデータからspaCyを使って1行1文のデータに変換します。その際に実行環境としてAmazon SageMaker Processingを使います。

なぜspaCyを使うのか

複数の文で構成されているテキストを文単位にわけるというのは、単純なように思えるかもしれません。日本語であれば多くの場合、文は「。」（句点）で区切られます。しかし、それだけでしょうか？今まさに見たように、句点だけなく「？」（クエスチョンマーク）も対象になります。

そして、まさに上記の段落を句点やクエスチョンマークで分けたとしたら、以下のようになります。

複数の文で構成されているテキストを文単位にわけるというのは、単純なように思えるかもしれません。
日本語であれば多くの場合、文は「。
」（句点）で区切られます。
しかし、それだけでしょうか？
今まさに見たように、句点だけなく「？
」（クエスチョンマーク）も対象になります。

つまり、句点やクエスチョンマークで分けるだけでは不十分ということになります。記号だけでなく、文の前後の関係を考える必要があり、それをルール化するのは簡単ではありません。

spaCyは自然言語処理のためのフレームワークです。テキスト分類や、固有表現抽出などを行うことができます。この際に機械学習を利用した言語モデルを利用することができ、英語はもちろん、日本語のモデルも存在しています。

その機能の一つにテキストを文に分解する機能があるので、複雑なルールを定義することなく、今回の目的を達成することができます。

例えば、Transformerを使った日本語の言語モデルである、ja_core_news_trfを使用した場合、先ほどの例は以下のようになります。

import spacy


nlp = spacy.load('ja_core_news_trf')
doc = nlp("複数の文で構成されているテキストを文単位にわけるというのは、単純なように思えるかもしれません。日本語であれば多くの場合、文は「。」（句点）で区切られます。しかし、それだけでしょうか？今まさに見たように、句点だけなく「？」（クエスチョンマーク）も対象になります。")
for s in doc.sents:
    print(s)

# 複数の文で構成されているテキストを文単位にわけるというのは、単純なように思えるかもしれません。
# 日本語であれば多くの場合、文は「。」（句点）で区切られます。
# しかし、それだけでしょうか？
# 今まさに見たように、句点だけなく「？」（クエスチョンマーク）も対象になります。

うまく文末の「？」や「。」で文章が区切られています。

Dockerイメージの準備

今回の処理ではspaCyを使うので、以下のようなDockerイメージを用意します。

FROM python:3.10-slim-buster

RUN pip3 install spacy==3.4.1
RUN python -m spacy download en_core_web_trf
ENV PYTHONUNBUFFERED=TRUE
ENTRYPOINT ["python3"]

使用するデータは前回と同じ英語テキストを使用するので、en_core_web_trf の言語モデルを使用します。日本語の場合はja_core_news_trfをダウンロードします。

処理スクリプトの作成

Processingで、データを処理するスクリプトを作ります。

from io import TextIOWrapper
import logging
import pathlib
import spacy

logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())


class Processor:
    def __init__(self, output_file: TextIOWrapper):
        self.nlp = spacy.load("en_core_web_trf")
        self.output_file = output_file

    def process(self, line: str):
        doc = self.nlp(line.rstrip())
        for sent in doc.sents:
            self.write_to_file(sent.text)

    def write_to_file(self, sentence: str):
        self.output_file.write(sentence + "\n")


if __name__ == "__main__":
    logger.info("Starting processing.")
    base_dir = "/opt/ml/processing"
    input_dir = f"{base_dir}/input"
    output_dir = f"{base_dir}/output"
    pathlib.Path(f"{base_dir}/output").mkdir(parents=True, exist_ok=True)

    with open(f"{input_dir}/data.txt", "r") as input_f:
        with open(f"{output_dir}/data.txt", "w") as output_f:
            conv = Processor(output_f)
            while True:
                line = input_f.readline()
                if not line:
                    break
                conv.process(line)

    logger.info("Finished processing.")

入力ファイルの各行は段落単位のデータになっているので、spaCyを使って文単位に区切ってあげるだけです。

Processingの起動

起動処理自体は前回とほぼ同じですが、インスタンスタイプが異なります。Transformerを利用しているので、ml.t3.mediumではメモリが足りません。なので、ml.t3.large を指定します。このようにメモリの増減も気軽にできるのが、Processingを利用するメリットでもあります。

from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

image_url = (
    "ACCOUNT_ID.dkr.ecr.ap-northeast-1.amazonaws.com/script-processor:xxxx"

)

script_processor = ScriptProcessor(
    command=["python3"],
    image_uri=image_url,
    role="arn:aws:iam::ACCOUNT_ID:role/SageMakerExecutionRole",
    instance_count=1,
    instance_type="ml.t3.medium",
)

script_processor.run(
    code="processing.py",
    inputs=[
        ProcessingInput(
            source="s3://DATA_BUCKET/input",

            destination="/opt/ml/processing/input",
        )
    ],
    outputs=[
        ProcessingOutput(
            source="/opt/ml/processing/output",
            destination="s3://DATA_BUCKET/output",
        )
    ],
)

上記のスクリプトを実行して、しばらくすると、ProcessingのJobが完了します。

指定したS3バケットを見てみると、以下のような1行1文のデータが出力されています。

(下記テキストは、リデルハート著「Scipio Africanus: Greater Than Napoleon」より引用。)

THUS in the spring of 204 B.C. Scipio embarked his army at Lilybæum (modern Marsala), and sailed for Africa. His fleet is said to have comprised forty warships and four hundred transports, and on board was carried water and rations for fifty-five days, of which fifteen days' supply was cooked. Complete dispositions were made for the protection of the convoy by the warships, and each class of vessel was distinguished by lights at night—the transports one, the warships two, and his own flagship three. It is worth notice that he personally supervised the embarkation of the troops.

上記のテキストが、以下のように1行1文の形に処理されています。

THUS in the spring of 204 B.C. Scipio embarked his army at Lilybæum (modern Marsala), and sailed for Africa.
His fleet is said to have comprised forty warships and four hundred transports, and on board was carried water and rations for fifty-five days, of which fifteen days' supply was cooked.
Complete dispositions were made for the protection of the convoy by the warships, and each class of vessel was distinguished by lights at night—the transports one, the warships two, and his own flagship three.
It is worth notice that he personally supervised the embarkation of the troops.

B.C.といった特殊な略語に関しても、正しく1文として処理されています。このような特殊なケースも、spaCyを利用することで簡易に扱えます。