Jcseg shared in this issue is a lightweight Chinese word classifier based on mmseg algorithm, which integrates the functions of keyword extraction, key phrase extraction, key sentence extraction and automatic article summarization.
Jcseg core functionality
- Chinese word segmentation: mmseg algorithm + Jcseg original optimization algorithm, seven segmentation modes.
- Keyword extraction: based on textRank algorithm.
- Key phrase extraction: based on textRank algorithm.
- Key sentence extraction: based on textRank algorithm.
- Automatic summarization: Based on BM25+textRank algorithm.
- Automatic part-of-speech tagging: Based on thesaurus + (statistical ambiguity removal program), the current effect is not very ideal, and it is not recommended to use in applications that require high part-of-speech tagging results.
- Named entity annotation: based on thesaurus + (statistical ambiguity removal Program), email, Web address, mainland mobile phone number, place name, person name, currency, datetime time, length, area, distance unit, etc.
- Restful api: Embedded jetty provides an absolute high-performance server module, including all functions of the http interface, standardized json output format, easy to call a variety of language clients directly.
Jcseg Quick experience
terminal test:
- cd to the Jcseg root directory.
- ant all(or compile with maven)
- operation:java -jar jcseg-core-{version}.jar
- You will see the following terminal interface
- Enter text at the cursor to start the test (Enter :seg_mode parameter switch to experience various segmentation algorithms)
+--------Jcseg chinese word tokenizer demo-------------------+|- @Author chenxin<chenxin619315@gmail.com> ||- :seg_mode : switch to specified tokenizer mode. ||- (:complex,:simple,:most,:detect,:delimiter,:NLP,:ngram) ||- :keywords : switch to keywords extract mode. ||- :keyphrase : switch to keyphrase extract mode. ||- :sentence : switch to sentence extract mode. ||- :summary : switch to summary extract mode. ||- :help : print this help menu. ||- :quit : to exit the program. |+------------------------------------------------------------+jcseg~tokenizer:complex>>
Test template:
Ambiguities and synonyms: Research on the origin of life, hybrid words: Do a B-ultrasound to check the body, X-ray what is the essence, today go to the karaoke ktv in Chidu, Doraemon is an anime protagonist, unit and full character: August 6, 2009 began the university trip, the temperature in Yueyang today is 38.6℃, that is, 101.48℉, Chinese numbers/scores: you score two-thirtieth, Xiao Chen took five-thirtieth, the remaining twenty-three thirtieth are all mine, that is 1998 years ago, Sichuan malatang is very delicious, the spirit of the May Fourth movement left behind. The notebook was on sale at a loss. I am Chen Xin, also the author of jcseg, Zhuge Liang in The Three Kingdoms period is a genius, we cheer for Liu Xiang together, Luo Zhigao was very excited because Lao Wu sent him a notebook. On July 1, Iceland time, Tom Cruise, who is filming in the local area, admitted through a spokesman that his marriage with his third wife Katie Holmes (his first and second wives were Mimi Rogers and Nicole Kidman) is ending. Matching punctuation: The winner of the "Imagine Cup" hacker technology competition is Zhang SAN of Telecom 09-2BF, and the prize is a book of C++ programming language and a set of "PHP tutorial" of "Imagine Network". Special letters: 【Ⅰ】 (Ⅱ), English numbers: bug report chenxin619315@gmail.com or visit http://code.google.com/p/jcseg, we all admire the hacker spirit! Special numbers: ① ⑩ (x).//code.google.com/p/jcseg, we all admire the hacker spirit!特殊数字: ① ⑩ ⑽ ㈩.
Word segmentation results:
Ambiguity /n and /o synonyms /n :/w research /vn thinking /vn research /vn research /vn life /n origin /n, /w Mixed words :/w doing /v B-ultrasound /n examination /vn body /n, /w X-ray /n X-ray /n essence /n is /a what /n, /w today /t go to /q Qidu ktv/nz sing /n karaoke /nz go to /q, /w Doraemon /nz is /a /q anime /n in /q /u protagonist /n, /w units /n and /o full Angle /nz :/w 2009 /m August /m 6 /m Start /n University /n Tour, /w Yuyang /ns Today /t /u temperature /n is /u 38.6℃/m,/w is /v 101.48℉/m,/w Chinese /n Mandarin /n digital /n /w score /n :/w you /r minutes /h two-thirtieth /m,/w Xiaochen /nr /nh Five-thirtieth /m,/w remaining /v /u twenty-three thirtieth /m all /a is /a my /nt,/w That is /c 1998 /m 1998 /m before /v's /u thing /i /i,/w Sichuan /ns Malatang /n is very /m delicious /v, /w May 4th Movement /nz left /v /u May 4th /m 54/m spirit /n. /w notebook /n 50% off /m 50% off /m free mail at a loss /v Great Sale Sale. /w name /n identification /v :/w I /r is /a Chen Xin /nr, /w also /e is /a jcseg/en /u author /n, /w Three Kingdoms /mq period /u Zhuge Liang /nr is a genius /n, /w we /r together /d to /v Liu Xiang /nr refueling /v, /w Luo Zhigao /nr excited /v extremely /u because /c Wu /nr sent him /r a notebook /n. /w Foreign language /n name /j identification /v: /w Iceland /ns time /n July /m 1 /m, /w is filming /u local /s /vi /u Tom Cruise /nr Cruise /nr acknowledged by /v speaker /n /v, /w He /r and /u third /m /q wife /n Katie Hermes /nr (/w first /a second /j /q wife /n Mimi Rogers /nr, /w Nicole Kidman /nr) /w's /u marriage /n is about to end /d /v. /w pairing /v Mark /n :/w This /r "/w Fancy Cup /nz" /w Hacker /n Technology /n Contest /vn /u winner /n is /u Telecom /nt 09/en -/w bf/en 2bf/en's /u Chang SAN /nr, /w Award /vn c++/en Programming /gi language /n book /ns and /o [/w Imagine Network /nz] /w /u "/w PHP tutorial /nz" /w set /m. /w Special /a letter /n :/w 【/ Wⅰ /nz 】/w (/ Wⅱ /m) /w, /w English /n English /n numbers /n :/w bug/en report/en chenxin/en 619315/en gmail/en com/en chenxin619315@gmail.com/en or/en visit/en http/en :/w //w //w code/en google/en com/en code.google.com/en //w p/en //w jcseg/en ,/w we/en all/en admire/en appreciate/en like/en love/en enjoy/en the/en hacker/en spirit/en mind/en ! /w special /a number /n :/w ①/m ⑩/m tub /m x /m./w
JcsegMaven warehouse
<dependency> <groupId>org.lionsoul</groupId> <artifactId>jcseg-core</artifactId> <version>2.6.2</version></dependency>
- jcseg-analyzer (lucene or solr)
<dependency> <groupId>org.lionsoul</groupId> <artifactId>jcseg-analyzer</artifactId> <version>2.6.2</version></dependency>
- jcseg-elasticsearch
<dependency> <groupId>org.lionsoul</groupId> <artifactId>jcseg-elasticsearch</artifactId> <version>2.6.2</version></dependency>
- jcseg-server (Independent application server)
<dependency> <groupId>org.lionsoul</groupId> <artifactId>jcseg-server</artifactId> <version>2.6.2</version></dependency>
Part of speech contrast of Jcseg
Noun n, time word t, place word s, locality word f, number word m, quantifier q, differentiator b, pronoun r, verb v, adjective a, state word z, adverb d, preposition p, conjunction c, particle u, modal word y, interjection e, onomatopoietic word o, idiom i, idiomatic expression l, short j, prefix h, suffix k, morpheme g, non-morphemic word x, punctuation mark w) In addition, from the perspective of corpus application, proper nouns (personal name nr, geographical name ns, institutional name nt, other proper nouns nz) are added.
Jcseg Synonym management
1.Unified thesaurus classification:
Since version 2.2.0 jcseg has unified synonyms into a single category -CJK_SYN, you can append your synonym definitions directly to the existing synonym thesaurus
vendors/lexicons/lex-synonyms.lex can also create a separate thesaurus, classify it as a synonym thesaurus by adding the CJK_SYN definition to the first line, and then add the synonym definitions line by line or line by line in the format described below.
2.Unified synonym format:
Format: Roots, synonyms 1[/ Optional Pinyin], synonyms 2[/ Optional Pinyin],… Synonym n[/ Optional Pinyin] For example: single line definition: research, study, study, grind /yan mo, research and development of multi-line definition: (as long as the root is the same, all synonyms of the definition belong to the same set) Central one, Central one, Central one, Central one channel, Central one, Central One channel, Central one
3. Format and requirements:
1, the first word is the root term of the synonym, which must be the term that must exist in the CJK_WORD thesaurus, if it does not exist, the synonym definition will be ignored. 2, the root word is used as a distinction between different synonym sets. If two lines of synonyms define the same root word, they are automatically merged into one synonym set. 3, used in jcseg org. Lionsoul. Jcseg. SynonymsEntry to manage the collection of synonyms, every IWord entry object has a SynonymsEntry attribute to point to his collection of synonyms. 4, SynonymsEntry.rootWord stores the root word of the synonym set, and the merge of synonyms is suggested to replace the root word uniformly. 5. Except for synonyms other than the root word, jcseg will automatically detect and create the relevant IWord term object and add it to the CJK_WORD thesaurus, that is, the other synonyms do not have to be the terms existing in the CJK_WORD thesaurus. 6, other synonyms will automatically inherit the part of speech and entity definition of the root, as well as the pinyin definition of the term in the CJK_WORD lexicon (if the word exists), or the pinyin can be defined separately by adding "/ pinyin "at the end of the term. All IWord entries in the set defined by the same synonym point to the same SynonymsEntry object, that is, synonyms are automatically referenced to each other.来单独定义拼音。7,同一同义词定义的集合中的全部IWord词条都指向同一个SynonymsEntry对象,也就是同义词之间会自动相互引用。
You can read more on your own.