Introduction to RcppMeCab 0.0.1.0
There are several part-of-speech morphological analyzers in Asian languages. Contrary to English, East Asian language needs morphological analyzer for natural language processing, since the same character could have several meanings based on its position of the sentence and there are some languages which are not segmented.
For Japanese, MeCab is the current state of the art part of speech and morphological analyzer. For Korean, MeCab is one of the best analyzers for this purpose. I don’t know much about Chinese, but MeCab could be used since Japanese also makes extensive use of Chinese characters.
Behind Story of Developing RcppMeCab
As a researcher who has the interest to analyze text, R is an attractive tool. Active research infrastructure for text mining for English exists, so a lot of resources could be drawn to the research of Korean text. For better research results, vast numbers of corpus should be analyzed. But the most widely used R package for analyzing Korean KoNLP runs in RJava. Since RJava operates on JVM, there are limitations on memory size and processing speed.
Hence, I developed Rmecab which runs on C++ with Rcpp in 2017. But it has several shortcomings. First, MeCab supports CJK (and there is a reference for the Thai language), but Rmecab analyzes Korean only. Second, I failed to build a stable Windows version for that package.
Rmecab package got a warm attention from Korean R community. And Heewon Jeon, the developer of KoNLP package and a member of the R text analysis software developer community, gave a comment about expanding a language support of the package. I started to tackle the problems I failed to solve before. And this is the result of the effort I have paid.
RcppMeCab: Installation
It is possible to install RcppMeCab
from Github repository. For the present, I don’t consider offering compiled packages, since this package tries to support various languages based on the MeCab binary the user installed. If there’s a good way to provide compiled versions of each language, it might be supported later.
Mac OSX & Linux
First, install MeCab
.
$ tar zxfv mecab-X.X.tar.gz (or mecab-ko-XX.tar.gz)
$ cd mecab-X.X (or cd mecab-ko-XX)
$ ./configure
$ make
($ make check)
$ su
# make install
Second, install MeCab dictionary you want to use for analyzing.
- For Korean: mecab-ko-dic
- For Japanese: please refer to MeCab
- For Chinese: please refer to blog
$ tar zxfv mecab-ipadic-2.7.0-XXXX.tar.gz (or tar zxfv mecab-ko-dic-XX.tar.gz)
$ cd mecab-ipdadic-2.7.0-XXXX (or cd mecab-ko-dic-XX)
$ ./configure
$ make
$ su
# make install
Third, install RcppMeCab
in the R console.
install.packages("devtools")
devtools::install_github("junhewk/RcppMeCab")
Windows OS
First, install MeCab
binary.
- For Korean: download and uncompress mecab-ko-msvc and mecab-ko-dic-msvc.
- For Japanese: install mecab-0.996.exe.
I don’t know there is a MeCab Chinese version for Windows OS. If there is, please let me know to test and support it.
Second, install RcppMeCab
in the R console. You need to install Rtools since it is installed with compiling. And, download MeCab dll and uncompress it before the installation.
# download and install `Rtools`
# download and uncompress `mecab` DLLs
install.packages("devtools")
devtools::install_github("junhewk/RcppMeCab")
RcppMeCab: Basic
RcppMeCab tries to utilize simplicity and power Rcpp provides. Rcpp’s data type vector
and its subtypes support UTF-8 handling in C++ level, thus RcppMeCab ensures CJK developing and analyzing environment without painful endeavor to struggle with encoding. It is also much faster than RJava or R native coding.
RcppMeCab supports Windows OS also. Since its shared dll compiles a dll file from mecab-ko
, I’m not sure it could analyze Japanese and Chinese sentences, too. But MeCab
could analyze different languages if the user supplies an appropriate dictionary, I think this package could also analyze languages other than Korean. If you could test this package in such a language environment like Japanese, please give me feedback.
To date, the package provides pos
function only.
pos(sentence, dict)
- You can input a text in
sentence
. It only accepts a character vector, encoded in UTF-8. You should check the vector is encoded in UTF-8. You can change the encoding withiconv
function.stringi
package offers various functions to manage encodings of a character vector. - You can supply a user-specific location of
mecab-dic
indict
. If you want to use other than an installed dictionary, please input the full location of it. The package doesn’t offer compiled dictionary. The default value is “”, hence, it’s okay to not enter anything indict
.
For example,
pos("안녕하세요.") # "안녕하세요." is Korean for "Hello."
pos("こんにちは。", "/usr/local/libexec/mecab/ipadic") # "こんにちは。" is Japanese for "Hello."
Further Steps
I couldn’t test in the Chinese environment. I believe it would work, but give me feedback with your experience about installation and word segmentation.
Other functions will be added. First, this package focuses on the process speed, therefore it’ll combine RcppParallel
and MeCab’s own lattice mode to parse sentences simultaneously. Second, other results, like N-best or deinflected form of words, will be provided soon.
This package will remain in a compact size to reduce memory usage to the highest degree. Other convenient functions, for instance, calculating sentiment scores or making N-grams, will be supplied by other packages linked with this.
Comments