RegEx to Parse CC-CEDICT Entries
I haven’t had a chance to update the blog as much as I’d like to in the past couple of weeks. Lack of updates gave me some extra time to work on the 读者 (DuZhe) Text Analyzer, which is progressing very well. I have the reader working in a very rough form, but I still need to add a reasonable UI. Although, I do have an article that I’m polishing up now, almost ready to publish.
While evaluating the different dictionaries to use with the reader, I came across the need to parse CC-CEDICT entries from their text form. After an initial quick google, I didn’t see any articles with a quick RegEx solution for doing just that, so I jotted down my RegEx for parsing CC-CEDICT. Hopefully this will help other people who might google for the same purpose.
The RegEx:
^\s*(.+)\s+(.+)\s+\[(.+)\]\s+\/(.+)\/\s*$
This RegEx will take care of any trailing spaces before or after the actual dictionary entry. It assumes that each line follows the standard CC-CEDICT formatting, and will spit out four matches (1: Simplified 2: Traditional 3:Pinyin 4:English). Each individual English entry is still separated by ‘/’, but the trailing ‘/’ are removed.
-
http://dinglabs.com Jim
Artem's Twitter
- No public Twitter messages.














