RcppMeCab 0.0.1.1 Release
RcppMeCab 0.0.1.1 is released in Github. This version supports:
- Parallelization with Intel TBB library
join
parameter is addeduser_dic
parameter is added- Code tidying
Parallelization
The package supports parallel computation with Intel TBB
library which is included in RcppParallel
package. RcppParallel
package isn’t fully occupied in RcppMeCab
package because it doesn’t support CharacterVector
conversion.
Now the package shows 50x faster morpheme analyzing compared to NLP packages based on RJava
. For example,
# temp is a 100-length CharacterVector
# SimplePos22 is KoNLP's POS function
# sapply(temp, function(x) pos(x, "")) is RcppMeCab's pos function with R loop
# posLoop(temp, "") is RcppMeCab's loop version which runs in C++
# posParallel(temp, "") is RcppMeCab's parallelized version with Intel TBB library
> microbenchmark(SimplePos22(temp), sapply(temp, function(x) pos(x, "")), posLoop(temp, ""), posParallel(temp, ""))
Unit: milliseconds
expr min lq mean median
SimplePos22(temp) 579.159705 613.425725 650.97500 635.938074
sapply(temp, function(x) pos(x, "")) 156.958030 163.132787 183.11562 174.995390
posLoop(temp, "") 74.441377 81.289487 90.00520 84.931590
posParallel(temp, "") 7.406578 8.639878 12.43173 9.694593
uq max neval
657.36901 1014.7603 100
188.84118 356.2329 100
90.50949 218.8578 100
10.69478 213.4961 100
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4-4 KoNLP_0.80.1 RcppParallel_4.4.0
[4] Rcpp_0.12.16
loaded via a namespace (and not attached):
[1] digest_0.6.15 withr_2.1.2 DBI_1.0.0 magrittr_1.5
[5] RSQLite_2.1.1 stringi_1.2.2 blob_1.1.1 hash_2.2.6
[9] tau_0.0-20 devtools_1.13.5 tools_3.5.0 stringr_1.3.1
[13] bit64_0.9-7 bit_1.1-14 compiler_3.5.0 rJava_0.9-10
[17] memoise_1.1.0 Sejong_0.01
# RcppMeCab is not loaded for the comparison of basic tagger and loop version
This package uses RcppParallel
for futher development. Although RcppParallel
doesn’t support parallelization over CharacterVector
, I expect it’ll be added sometime.
join Parameter
You can set join = FALSE
in pos
and posParallel
functions. If join = FALSE
is given, then the return value will be morphemes only. Tags will be returned in the vector names. For example,
> pos("Hi")
[1] "Hi/SL"
> pos("Hi", join = FALSE)
SL
"Hi"
# SL is mecab-ko's pos id for foreign languages
User Dictionary
You can apply your compiled user dictionary to pos
and posParallel
function. To compile your CSV file, please refer Github. I’ll provide a full explanation about using mecab-dict-index
in later post.
# person.csv has its content,
# 폼페이오,,,,NNP,인명,F,폼페이오,*,*,*,*,*
# (폼페이오 = Pompeo, who is an American politician serves Secretary of State currently)
# The file is compiled by `mecab-dict-index` with `mecab-ko-dic` model and CSV files.
> pos("폼페이오")
[1] "폼페이/NNG" "오/VCP+EC"
> pos("폼페이오", user_dic = "~/user_dic.dic")
[1] "폼페이오/NNP"
Comments