An open source lightweight Chinese word divider

An open source lightweight Chinese word divider

2022-09-02 0 786
Resource Number 37499 Last Updated 2025-02-24
¥ 0HKD Upgrade VIP
Download Now Matters needing attention
Can't download? Please contact customer service to submit a link error!
Value-added Service: Installation Guide Environment Configuration Secondary Development Template Modification Source Code Installation

Jcseg shared in this issue is a lightweight Chinese word classifier based on mmseg algorithm, which integrates the functions of keyword extraction, key phrase extraction, key sentence extraction and automatic article summarization.

An open source lightweight Chinese word divider插图

Jcseg core functionality

  • Chinese word segmentation: mmseg algorithm + Jcseg original optimization algorithm, seven segmentation modes.
  • Keyword extraction: based on textRank algorithm.
  • Key phrase extraction: based on textRank algorithm.
  • Key sentence extraction: based on textRank algorithm.
  • Automatic summarization: Based on BM25+textRank algorithm.
  • Automatic part-of-speech tagging: Based on thesaurus + (statistical ambiguity removal program), the current effect is not very ideal, and it is not recommended to use in applications that require high part-of-speech tagging results.
  • Named entity annotation: based on thesaurus + (statistical ambiguity removal Program), email, Web address, mainland mobile phone number, place name, person name, currency, datetime time, length, area, distance unit, etc.
  • Restful api: Embedded jetty provides an absolute high-performance server module, including all functions of the http interface, standardized json output format, easy to call a variety of language clients directly.

Jcseg Quick experience

terminal test:

  1. cd to the Jcseg root directory.
  2. ant all(or compile with maven)
  3. operation:java -jar jcseg-core-{version}.jar
  4. You will see the following terminal interface
  5. Enter text at the cursor to start the test (Enter :seg_mode parameter switch to experience various segmentation algorithms)
+--------Jcseg chinese word tokenizer demo-------------------+|- @Author chenxin<chenxin619315@gmail.com>                  ||- :seg_mode  : switch to specified tokenizer mode.          ||- (:complex,:simple,:most,:detect,:delimiter,:NLP,:ngram)   ||- :keywords  : switch to keywords extract mode.             ||- :keyphrase : switch to keyphrase extract mode.            ||- :sentence  : switch to sentence extract mode.             ||- :summary   : switch to summary extract mode.              ||- :help      : print this help menu.                        ||- :quit      : to exit the program.                         |+------------------------------------------------------------+jcseg~tokenizer:complex>> 

Test template:

Ambiguities and synonyms: Research on the origin of life, hybrid words: Do a B-ultrasound to check the body, X-ray what is the essence, today go to the karaoke ktv in Chidu, Doraemon is an anime protagonist, unit and full character: August 6, 2009 began the university trip, the temperature in Yueyang today is 38.6℃, that is, 101.48℉, Chinese numbers/scores: you score two-thirtieth, Xiao Chen took five-thirtieth, the remaining twenty-three thirtieth are all mine, that is 1998 years ago, Sichuan malatang is very delicious, the spirit of the May Fourth movement left behind. The notebook was on sale at a loss. I am Chen Xin, also the author of jcseg, Zhuge Liang in The Three Kingdoms period is a genius, we cheer for Liu Xiang together, Luo Zhigao was very excited because Lao Wu sent him a notebook. On July 1, Iceland time, Tom Cruise, who is filming in the local area, admitted through a spokesman that his marriage with his third wife Katie Holmes (his first and second wives were Mimi Rogers and Nicole Kidman) is ending. Matching punctuation: The winner of the "Imagine Cup" hacker technology competition is Zhang SAN of Telecom 09-2BF, and the prize is a book of C++ programming language and a set of "PHP tutorial" of "Imagine Network". Special letters: 【Ⅰ】 (Ⅱ), English numbers: bug report chenxin619315@gmail.com or visit http://code.google.com/p/jcseg, we all admire the hacker spirit! Special numbers: ① ⑩ (x).//code.google.com/p/jcseg, we all admire the hacker spirit!特殊数字: ① ⑩ ⑽ ㈩.

Word segmentation results:

Ambiguity /n and /o synonyms /n :/w research /vn thinking /vn research /vn research /vn life /n origin /n, /w Mixed words :/w doing /v B-ultrasound /n examination /vn body /n, /w X-ray /n X-ray /n essence /n is /a what /n, /w today /t go to /q Qidu ktv/nz sing /n karaoke /nz go to /q, /w Doraemon /nz is /a /q anime /n in /q /u protagonist /n, /w units /n and /o full Angle /nz :/w 2009 /m August /m 6 /m Start /n University /n Tour, /w Yuyang /ns Today /t /u temperature /n is /u 38.6℃/m,/w is /v 101.48℉/m,/w Chinese /n Mandarin /n digital /n /w score /n :/w you /r minutes /h two-thirtieth /m,/w Xiaochen /nr /nh Five-thirtieth /m,/w remaining /v /u twenty-three thirtieth /m all /a is /a my /nt,/w That is /c 1998 /m 1998 /m before /v's /u thing /i /i,/w Sichuan /ns Malatang /n is very /m delicious /v, /w May 4th Movement /nz left /v /u May 4th /m 54/m spirit /n. /w notebook /n 50% off /m 50% off /m free mail at a loss /v Great Sale Sale. /w name /n identification /v :/w I /r is /a Chen Xin /nr, /w also /e is /a jcseg/en /u author /n, /w Three Kingdoms /mq period /u Zhuge Liang /nr is a genius /n, /w we /r together /d to /v Liu Xiang /nr refueling /v, /w Luo Zhigao /nr excited /v extremely /u because /c Wu /nr sent him /r a notebook /n. /w Foreign language /n name /j identification /v: /w Iceland /ns time /n July /m 1 /m, /w is filming /u local /s /vi /u Tom Cruise /nr Cruise /nr acknowledged by /v speaker /n /v, /w He /r and /u third /m /q wife /n Katie Hermes /nr (/w first /a second /j /q wife /n Mimi Rogers /nr, /w Nicole Kidman /nr) /w's /u marriage /n is about to end /d /v. /w pairing /v Mark /n :/w This /r "/w Fancy Cup /nz" /w Hacker /n Technology /n Contest /vn /u winner /n is /u Telecom /nt 09/en -/w bf/en 2bf/en's /u Chang SAN /nr, /w Award /vn c++/en Programming /gi language /n book /ns and /o [/w Imagine Network /nz] /w /u "/w PHP tutorial /nz" /w set /m. /w Special /a letter /n :/w 【/ Wⅰ /nz 】/w (/ Wⅱ /m) /w, /w English /n English /n numbers /n :/w bug/en report/en chenxin/en 619315/en gmail/en com/en chenxin619315@gmail.com/en or/en visit/en http/en :/w //w //w code/en google/en com/en code.google.com/en //w p/en //w jcseg/en ,/w we/en all/en admire/en appreciate/en like/en love/en enjoy/en the/en hacker/en spirit/en mind/en ! /w special /a number /n :/w ①/m ⑩/m tub /m x /m./w

JcsegMaven warehouse

<dependency>    <groupId>org.lionsoul</groupId>    <artifactId>jcseg-core</artifactId>    <version>2.6.2</version></dependency>
  • jcseg-analyzer (lucene or solr)
<dependency>    <groupId>org.lionsoul</groupId>    <artifactId>jcseg-analyzer</artifactId>    <version>2.6.2</version></dependency>
  • jcseg-elasticsearch
<dependency>    <groupId>org.lionsoul</groupId>    <artifactId>jcseg-elasticsearch</artifactId>    <version>2.6.2</version></dependency>
  • jcseg-server (Independent application server)
<dependency>    <groupId>org.lionsoul</groupId>    <artifactId>jcseg-server</artifactId>    <version>2.6.2</version></dependency>

Part of speech contrast of Jcseg

Noun n, time word t, place word s, locality word f, number word m, quantifier q, differentiator b, pronoun r, verb v, adjective a, state word z, adverb d, preposition p, conjunction c, particle u, modal word y, interjection e, onomatopoietic word o, idiom i, idiomatic expression l, short j, prefix h, suffix k, morpheme g, non-morphemic word x, punctuation mark w) In addition, from the perspective of corpus application, proper nouns (personal name nr, geographical name ns, institutional name nt, other proper nouns nz) are added.

Jcseg Synonym management

1.Unified thesaurus classification:

Since version 2.2.0 jcseg has unified synonyms into a single category -CJK_SYN, you can append your synonym definitions directly to the existing synonym thesaurus
vendors/lexicons/lex-synonyms.lex can also create a separate thesaurus, classify it as a synonym thesaurus by adding the CJK_SYN definition to the first line, and then add the synonym definitions line by line or line by line in the format described below.

2.Unified synonym format:

Format: Roots, synonyms 1[/ Optional Pinyin], synonyms 2[/ Optional Pinyin],… Synonym n[/ Optional Pinyin] For example: single line definition: research, study, study, grind /yan mo, research and development of multi-line definition: (as long as the root is the same, all synonyms of the definition belong to the same set) Central one, Central one, Central one, Central one channel, Central one, Central One channel, Central one

3. Format and requirements:

1, the first word is the root term of the synonym, which must be the term that must exist in the CJK_WORD thesaurus, if it does not exist, the synonym definition will be ignored. 2, the root word is used as a distinction between different synonym sets. If two lines of synonyms define the same root word, they are automatically merged into one synonym set. 3, used in jcseg org. Lionsoul. Jcseg. SynonymsEntry to manage the collection of synonyms, every IWord entry object has a SynonymsEntry attribute to point to his collection of synonyms. 4, SynonymsEntry.rootWord stores the root word of the synonym set, and the merge of synonyms is suggested to replace the root word uniformly. 5. Except for synonyms other than the root word, jcseg will automatically detect and create the relevant IWord term object and add it to the CJK_WORD thesaurus, that is, the other synonyms do not have to be the terms existing in the CJK_WORD thesaurus. 6, other synonyms will automatically inherit the part of speech and entity definition of the root, as well as the pinyin definition of the term in the CJK_WORD lexicon (if the word exists), or the pinyin can be defined separately by adding "/ pinyin "at the end of the term. All IWord entries in the set defined by the same synonym point to the same SynonymsEntry object, that is, synonyms are automatically referenced to each other.来单独定义拼音。7,同一同义词定义的集合中的全部IWord词条都指向同一个SynonymsEntry对象,也就是同义词之间会自动相互引用。

You can read more on your own.

资源下载此资源为免费资源立即下载
Telegram:@John_Software

Disclaimer: This article is published by a third party and represents the views of the author only and has nothing to do with this website. This site does not make any guarantee or commitment to the authenticity, completeness and timeliness of this article and all or part of its content, please readers for reference only, and please verify the relevant content. The publication or republication of articles by this website for the purpose of conveying more information does not mean that it endorses its views or confirms its description, nor does it mean that this website is responsible for its authenticity.

Ictcoder Free Source Code An open source lightweight Chinese word divider https://ictcoder.com/an-open-source-lightweight-chinese-word-divider/

Share free open-source source code

Q&A
  • 1. Automatic: After making an online payment, click the (Download) link to download the source code; 2. Manual: Contact the seller or the official to check if the template is consistent. Then, place an order and make payment online. The seller ships the goods, and both parties inspect and confirm that there are no issues. ICTcoder will then settle the payment for the seller. Note: Please ensure to place your order and make payment through ICTcoder. If you do not place your order and make payment through ICTcoder, and the seller sends fake source code or encounters any issues, ICTcoder will not assist in resolving them, nor can we guarantee your funds!
View details
  • 1. Default transaction cycle for source code: The seller manually ships the goods within 1-3 days. The amount paid by the user will be held in escrow by ICTcoder until 7 days after the transaction is completed and both parties confirm that there are no issues. ICTcoder will then settle with the seller. In case of any disputes, ICTcoder will have staff to assist in handling until the dispute is resolved or a refund is made! If the buyer places an order and makes payment not through ICTcoder, any issues and disputes have nothing to do with ICTcoder, and ICTcoder will not be responsible for any liabilities!
View details
  • 1. ICTcoder will permanently archive the transaction process between both parties and snapshots of the traded goods to ensure the authenticity, validity, and security of the transaction! 2. ICTcoder cannot guarantee services such as "permanent package updates" and "permanent technical support" after the merchant's commitment. Buyers are advised to identify these services on their own. If necessary, they can contact ICTcoder for assistance; 3. When both website demonstration and image demonstration exist in the source code, and the text descriptions of the website and images are inconsistent, the text description of the image shall prevail as the basis for dispute resolution (excluding special statements or agreements); 4. If there is no statement such as "no legal basis for refund" or similar content, any indication on the product that "once sold, no refunds will be supported" or other similar declarations shall be deemed invalid; 5. Before the buyer places an order and makes payment, the transaction details agreed upon by both parties via WhatsApp or email can also serve as the basis for dispute resolution (in case of any inconsistency between the agreement and the description of the conflict, the agreement shall prevail); 6. Since chat records and email records can serve as the basis for dispute resolution, both parties should only communicate with each other through the contact information left on the system when contacting each other, in order to prevent the other party from denying their own commitments. 7. Although the probability of disputes is low, it is essential to retain important information such as chat records, text messages, and email records, in case a dispute arises, so that ICTcoder can intervene quickly.
View details
  • 1. As a third-party intermediary platform, ICTcoder solely protects transaction security and the rights and interests of both buyers and sellers based on the transaction contract (product description, agreed content before the transaction); 2. For online trading projects not on the ICTcoder platform, any consequences are unrelated to this platform; regardless of the reason why the seller requests an offline transaction, please contact the administrator to report.
View details

Related Source code

ICTcoder Customer Service

24-hour online professional services