RcppMeCab 0.0.1.1 Release
RcppMeCab 0.0.1.1 is released in Github. This version supports:
- Parallelization with Intel TBB library
joinparameter is addeduser_dicparameter is added- Code tidying
Parallelization
The package supports parallel computation with Intel TBB library which is included in RcppParallel package. RcppParallel package isn’t fully occupied in RcppMeCab package because it doesn’t support CharacterVector conversion.
Now the package shows 50x faster morpheme analyzing compared to NLP packages based on RJava. For example,
# temp is a 100-length CharacterVector
# SimplePos22 is KoNLP's POS function
# sapply(temp, function(x) pos(x, "")) is RcppMeCab's pos function with R loop
# posLoop(temp, "") is RcppMeCab's loop version which runs in C++
# posParallel(temp, "") is RcppMeCab's parallelized version with Intel TBB library
> microbenchmark(SimplePos22(temp), sapply(temp, function(x) pos(x, "")), posLoop(temp, ""), posParallel(temp, ""))
Unit: milliseconds
expr min lq mean median
SimplePos22(temp) 579.159705 613.425725 650.97500 635.938074
sapply(temp, function(x) pos(x, "")) 156.958030 163.132787 183.11562 174.995390
posLoop(temp, "") 74.441377 81.289487 90.00520 84.931590
posParallel(temp, "") 7.406578 8.639878 12.43173 9.694593
uq max neval
657.36901 1014.7603 100
188.84118 356.2329 100
90.50949 218.8578 100
10.69478 213.4961 100
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4-4 KoNLP_0.80.1 RcppParallel_4.4.0
[4] Rcpp_0.12.16
loaded via a namespace (and not attached):
[1] digest_0.6.15 withr_2.1.2 DBI_1.0.0 magrittr_1.5
[5] RSQLite_2.1.1 stringi_1.2.2 blob_1.1.1 hash_2.2.6
[9] tau_0.0-20 devtools_1.13.5 tools_3.5.0 stringr_1.3.1
[13] bit64_0.9-7 bit_1.1-14 compiler_3.5.0 rJava_0.9-10
[17] memoise_1.1.0 Sejong_0.01
# RcppMeCab is not loaded for the comparison of basic tagger and loop version
This package uses RcppParallel for futher development. Although RcppParallel doesn’t support parallelization over CharacterVector, I expect it’ll be added sometime.
join Parameter
You can set join = FALSE in pos and posParallel functions. If join = FALSE is given, then the return value will be morphemes only. Tags will be returned in the vector names. For example,
> pos("Hi")
[1] "Hi/SL"
> pos("Hi", join = FALSE)
SL
"Hi"
# SL is mecab-ko's pos id for foreign languages
User Dictionary
You can apply your compiled user dictionary to pos and posParallel function. To compile your CSV file, please refer Github. I’ll provide a full explanation about using mecab-dict-index in later post.
# person.csv has its content,
# 폼페이오,,,,NNP,인명,F,폼페이오,*,*,*,*,*
# (폼페이오 = Pompeo, who is an American politician serves Secretary of State currently)
# The file is compiled by `mecab-dict-index` with `mecab-ko-dic` model and CSV files.
> pos("폼페이오")
[1] "폼페이/NNG" "오/VCP+EC"
> pos("폼페이오", user_dic = "~/user_dic.dic")
[1] "폼페이오/NNP"
Comments