课程门户-章节详情

计算机辅助翻译

王琴、赵萱、张烨、陕晋芬、吴佳卉、张井、郑珮雯

1 课程概览
- 1.1 10分钟说课
- 1.2 混合式教学设计汇报视频
- 1.3 同行评价
- 1.4 学生评价
- 1.5 培养目标和教学内容
2 语言服务行业概况及人才素养
- 2.1 章节知识点图谱
- 2.2 课前须知
- 2.3 语言服务行业概况
  - 2.3.1 教学视频
  - 2.3.2 课件
- 2.4 新时代译者信息素养
  - 2.4.1 教学视频
  - 2.4.2 课件
- 2.5 思辨：职业化时代译者信息素养如何培养？
- 2.6 洞悉前沿
3 搜索技术
- 3.1 章节知识点图谱
- 3.2 课前须知
- 3.3 搜索技术基础知识
  - 3.3.1 教学视频
  - 3.3.2 课件
  - 3.3.3 测试
- 3.4 大语言模型搜索
- 3.5 桌面搜索与翻译
  - 3.5.1 教学视频
  - 3.5.2 课件
  - 3.5.3 桌面搜索安装包
- 3.6 网络搜索与翻译
  - 3.6.1 教学视频
  - 3.6.2 课件
  - 3.6.3 测试
  - 3.6.4 思辨：搜索技术何以载道？
- 3.7 学术搜索
  - 3.7.1 学术搜索基础知识
    - 3.7.1.1 教学视频
    - 3.7.1.2 课件
  - 3.7.2 中国知网检索
    - 3.7.2.1 教学视频
    - 3.7.2.2 课件
  - 3.7.3 思辨：搜索引擎何以“去伪存真”？
- 3.8 学译致用：学子园地
  - 3.8.1 搜索与翻译实践（一）——胡腾蛟郭向东
  - 3.8.2 搜索与翻译实践（二）——胡腾蛟张宇萱
  - 3.8.3 搜索与翻译实践（三）——小组代表
  - 3.8.4 搜索与翻译实践（四）——学生代表
- 3.9 洞悉前沿
4 术语管理技术
- 4.1 章节知识点图谱
- 4.2 课前须知
- 4.3 术语管理基础知识
  - 4.3.1 教学视频
  - 4.3.2 课件
  - 4.3.3 测试
  - 4.3.4 作业
  - 4.3.5 思辨：术语库构建如何帮助构建知识体系？
- 4.4 术语管理技术基础知识
  - 4.4.1 教学视频
  - 4.4.2 课件
- 4.5 大语言模型术语提取操作
- 4.6 语帆术语宝术语操作
  - 4.6.1 教学视频
  - 4.6.2 课件
  - 4.6.3 作业
  - 4.6.4 思辨：自动抽取术语是否足够智能？
- 4.7 Word文档术语标注工具
  - 4.7.1 教学视频
  - 4.7.2 课件
  - 4.7.3 作业
- 4.8 Multiterm 术语管理
  - 4.8.1 教学视频（一）
  - 4.8.2 教学视频（二）
  - 4.8.3 课件
  - 4.8.4 作业
- 4.9 YiCAT术语管理
  - 4.9.1 教学视频
  - 4.9.2 课件
  - 4.9.3 作业
- 4.10 不同格式术语库转换
  - 4.10.1 教学视频
  - 4.10.2 课件
- 4.11 思辨：道器并举中如何规范术语管理？
- 4.12 洞悉前沿
5 语料处理技术
- 5.1 章节知识点图谱
- 5.2 课前须知
- 5.3 语料处理技术基础知识
  - 5.3.1 教学视频
  - 5.3.2 课件
- 5.4 大语言模型语料对齐
- 5.5 Tmxmall 在线对齐操作
  - 5.5.1 教学视频（一）
  - 5.5.2 教学视频（二）
- 5.6 学译致用：学子园地
  - 5.6.1 语料采集工具—ABBYY FineReader文本识别—杨潇倩
  - 5.6.2 语料采集工具—天若OCR—贾智凯
  - 5.6.3 语料清洗工具—EmEditor&Word—马晓雯
  - 5.6.4 语料对齐工具—ABBYY Aligner 对齐操作——李欣
  - 5.6.5 语料对齐工具对比
- 5.7 作业
- 5.8 思辨：何为语料对齐的标准？价值何在？
- 5.9 洞悉前沿
6 计算机辅助翻译（CAT）技术
- 6.1 章节知识点图谱
- 6.2 课前须知
- 6.3 计算机辅助翻译技术基础知识
  - 6.3.1 教学视频
  - 6.3.2 课件
- 6.4 案例引入
  - 6.4.1 教学视频
  - 6.4.2 课件
- 6.5 项目准备
  - 6.5.1 教学视频
  - 6.5.2 课件
- 6.6 创建翻译项目
  - 6.6.1 教学视频（一）
  - 6.6.2 教学视频（二）
  - 6.6.3 课件
- 6.7 分配任务
  - 6.7.1 教学视频
  - 6.7.2 课件
- 6.8 翻译编辑
  - 6.8.1 教学视频（一）
  - 6.8.2 教学视频（二）
  - 6.8.3 教学视频（三）
  - 6.8.4 课件
- 6.9 审校
  - 6.9.1 教学视频
  - 6.9.2 课件
- 6.10 导出译文
  - 6.10.1 教学视频
  - 6.10.2 课件
- 6.11 机器翻译\大语言模型翻译译后编辑
  - 6.11.1 基础知识
  - 6.11.2 测试
  - 6.11.3 作业
  - 6.11.4 译后编辑实践汇报-课堂实录
    - 6.11.4.1 课堂教学实录
    - 6.11.4.2 课堂教学实录课件
- 6.12 学译致用：学子园地
  - 6.12.1 译马网—史鑫梅
  - 6.12.2 TransGod—孙慕瑶
  - 6.12.3 Déjà Vu X3—刘茜文马晓雯芦舒淇
  - 6.12.4 Transmate—李智慧
  - 6.12.5 CAT工具对比—学生代表
  - 6.12.6 CAT工具操作视频
- 6.13 测试
- 6.14 作业
- 6.15 慎思明辨：CAT技术何以有温度？
- 6.16 洞悉前沿
- 6.17 课程思政
7 字幕翻译
- 7.1 章节知识点图谱
- 7.2 课前须知
- 7.3 字幕翻译基础知识
- 7.4 字幕翻译实操（一）
- 7.5 字幕翻译实操（二）
- 7.6 测试
- 7.7 作业
- 7.8 思辨：字幕翻译如何助力文化走出去？
- 7.9 洞悉前沿
- 7.10 实践成果
- 7.11 译林夕照
8 自学资源拓展区
- 8.1 教师、学生部分发表、撰写相关论文、推文
- 8.2 国内外常用大语言模型
- 8.3 CAT工具资源
- 8.4 翻译技术参考读物
- 8.5 翻译服务规范
- 8.6 常用在线词典资源
- 8.7 常用机器翻译引擎
- 8.8 常用字幕翻译工具

洞悉前沿

1 资源推荐
2 期刊文献
3 知行合一
4 阅读书目

常用语料库资源：
常用在线词典资源：

[1]谢智敏,郭倩玲.基于深度学习的学术搜索引擎——Semantic Scholar[J].情报杂志,2017,36(08):175-182.

[2]王华树,张成智.大数据时代译者的搜索能力探究[J].中国科技翻译,2018,31(04):26-29.DOI:10.16024/j.cnki.issn1002-0489.2018.04.025.

[3]陆峰.中国搜索引擎十五年:从信息到服务的连接[J].互联网经济,2015(11):82-89.

[4]秦洪武,王克非.基于语料库的翻译语言分析——以“so…that”的汉语对应结构为例[J].现代外语,2004(01):40-48+105-106.

[5]许伟.平行语料库在翻译批评中的应用——以培根Of Studies的不同译本为例[J].外语研究,2006(02):54-59.

[6] 汪兴富,Mark Davies,刘国辉.美国当代英语语料库(COCA)——英语教学与研究的良好平台[J].外语电化教学,2008(05):27-33.

[7]刘泽权,闫继苗.基于语料库的译者风格与翻译策略研究——以《红楼梦》中报道动词及英译为例[J].解放军外国语学院学报,2010,33(04):87-92+128.

[8]王子颖.法律语篇中shall和may的翻译对比研究[J].上海翻译,2013(04):52-57.

[9]冷雪莲.基于COCA语料库辨析英语同义词Capable和Competent[J].成都师范学院学报,2015,31(02):54-58.

[10]兰丽珍.基于COCA语料库的英语近义词研究——以careful和cautious为例[J].内蒙古财经大学学报,2017,15(06):107-110.DOI:10.13895/j.cnki.jimufe.2017.06.028.

[11]刘辉,龚芳霞.基于COCA语料库的英语近义词对比分析——以“vice”和“associate”为例[J].中国石油大学胜利学院学报,2020,34(04):63-66.

Something about the English corpora:

1. Who created these corpora?

The underlying corpus architecture and web interface were created by Mark Davies, (retired) Professor of Linguistics. In most cases, he also designed, collected, edited, and annotated the corpora as well. In the case of the BNC, Strathy, EEBO, and Hansard corpora, I received the texts from others, and "just" created the architecture and interface. So although I use the terms "we" and "us" on this and other pages, most activities related to the development of most of these corpora were actually carried out by just one person.

2. Who else contributed?

Multiple corpora	The Corpus del Español, the Corpus do Português, and the new Corpus of Historical American English were funded by large grants from the National Endowment for the Humanities.
Multiple corpora	Paul Rayson provided the CLAWS tagger, which was used for all of the English corpora.
COCA	Some BYU students helped to scan a few of the novels.
COHA	Several BYU students helped to scan novels, magazines, and non-fiction books, and to help process and correct the files and lexicon.

Google Books	Based on the datasets from Google Books
BNC	The original texts were licensed for re-use from Oxford University Press.
Strathy	The textual corpus was designed and created at the Strathy Language Unit at Queen's University in Canada.
Hansard, EEBO	The vast majority of the work on the corpus (including semantic tagging) was done by other participants in the SAMUELS project. I simply created the corpus architecture and interface.

3. What is the advantage of these corpora over other ones that are available?

For some languages and time periods, these are really the only corpora available. For example, in spite of earlier corpora like the American National Corpus and the Bank of English, our Corpus of Contemporary American English is the only large, balanced corpus of contemporary American English. In spite of the Brown family of corpora and the ARCHER corpus, the Corpus of Historical American English is the only large and balanced corpus of historical American English. And while the ICE corpora are useful for looking at dialectal variation in English, the GloWbE corpus is about 100 times as large (and somewhat more diverse). Beyond the "textual" corpora, however, the corpus architecture and interface that we have developed allows for speed, size, annotation, and a range of queries that we believe is unmatched with other architectures, and which makes it useful for corpora such as the British National Corpus, which does have other interfaces. Also, they're free -- a nice feature.

4. What software is used to index, search, and retrieve data from these corpora?

We have created our own corpus architecture, using Microsoft SQL Server as the backbone of the relational database approach. Our proprietary architecture allows for size, speed, and very good scalability that we don't believe are available with any other architecture. Even complex queries of the more than 600 million word COCA corpus or the 400 million word COHA corpus typically only take two or three seconds (and not much more for the 14 billion word iWeb corpus). In addition, because of the relational database design, we can keep adding on more annotation "modules" with little or no performance hit. Finally, the relational database design allows for a range of queries that we believe is unmatched by any other architecture for large corpora.

5. How many people use the corpora?

As measured by Google Analytics, as of March 2022 the corpora are used by more than 75,000 registered users each month. The most widely-used corpus is the Corpus of Contemporary American English -- with more than 65,000 unique users each month. And people don't just come in, look for one word, and move on -- average time at the site each visit is between 10-15 minutes. (More information...)

6. What do they use the corpora for?

For lots of things. Linguists use the corpora to analyze variation and change in the different languages. Some are materials developers, who use the data to create teaching materials. A high number of users are language teachers and learners, who use the corpus data to model native speaker performance and intuition. Translators use the corpora to get precise data on the target languages. Other people in the humanities and social sciences look at changes in culture and society (especially with COHA and Hansard). Some businesses purchase data from the corpora to use in natural language processing projects. And lots of people are just curious about language, and (believe it or not) just use the corpora for fun, to see what's going on with the languages currently. To get a better idea of what people are doing with the corpora, check out (or search through) the entries from the Researchers page.

7. What about copyright?

While our corpora contain some copyrighted material, there is no problem in terms of US copyright law (or US Fair Use Law), because users are limited to accessing very limited "Keyword in Context" (KWIC) displays of the text. It's kind of like the "snippet defense" used by Google. They retrieve and index billions of words of copyright material, but they only allow end users to access "snippets" of this data from their servers. Click here for an extended discussion of US Fair Use Law and how it applies to our texts.

8. Can I get access to the full text of these corpora?

Downloadable, full-text data is now available for the following corpora: iWeb, COCA, COHA, GloWbE, NOW, Wikipedia, SOAP, the TV corpus, the Movie corpus, and the Corpus del Español.

9. Is there API access to the corpora?

No, there isn't. There are two main reasons for this. First, we don't have copyright access to the texts in the corpora, and so we can only provide limited access to the corpora, via the corpus interface. Second, we're already pretty "maxed out" in terms of the one corpus server, and API access would probably lead to quite a bit more queries, which we can't handle right now. Although we don't allow API access, some people have programmed browsers (via Python or C++ or whatever) to allow for semi-automated queries (note, though, that we don't provide tech support for this).

10. My access limits (for "non-researcher") are too low. Can I increase them?

"Non-researchers" (Level 1) have 50 queries a day, or about 1,500 queries per month. For most people, this is way more than enough. But if you really do need more than 1,500 queries per month, then you might want to upgrade to a premium account, which also helps to support the corpora, in which case you will have 200 queries a day.

11. My organization doesn't list my name on a web-page. Can I still register to use the corpora?

You do not need to register as a "researcher" to use the corpora. Even the lowest level, default "non-researcher" status gives you 50 queries a day, or about 1,500 queries per month. For most people, this is way more than enough. The only downside is that you won't be included on the list of researchers, but that's not a huge deal. On the other hand, if you really do need more than 1,500 queries per month, then you might want to upgrade to a premium account, which also helps to support the corpora, in which case you will have 200 queries a day (6,000 per month).

12. I want more data than what's available via the standard interface. What can I do?

Users can purchase offline data -- such as full text copies of the texts, frequency lists, collocates lists, n-grams lists (e.g. all two or three word strings of words). Click here for much more detailed information on this data, as well as downloadable samples.

13. Can my class have additional access to a corpus on a given day?

There is a limit of 250 queries per 24 hours for a "group", where a group is typically a class of students or a department at a university. If you need more queries than this, you'd want an academic / site license..

14. I don't want to see the messages that appear every 10-15 searches as I use the corpora.

If you have a premium account, you won't see these messages anymore (during the year in which your premium account is valid, if it is for a full year: $30).

If you just want a basic account and are really bothered by the messages, you might want to consider other web-based corpora -- like those from Lancaster University (including BNCweb), CorpusEye, or the many excellent corpora from Sketch Engine. (Please be aware, though, that the subscription fee for the Sketch Engine corpora is somewhat more expensive than the cost of a premium account for the corpora.)

15. How do I cite the corpora in my published articles? Can I use screenshots from the corpora in my publication / presentation?

Please use the following information when you cite the corpus in academic publications or conference papers.

In the first reference to the corpus in your paper, please use the full name. For example, for COCA: "the Corpus of Contemporary American English" with the appropriate citation to the references section of the paper, e.g. (Davies 2008-). After that reference, feel free to use something shorter, like "COCA" (for example: "...and as seen in COCA, there are..."). Also, please do not refer to the corpus in the body of your paper as "Davies' COCA corpus", "a corpus created by Mark Davies", etc. The bibliographic entry itself is enough to indicate who created the corpus. Finally, please do not refer to any of these corpora as being part of the "BYU Corpora". Thanks.

You are also welcome to use screenshots from the corpora in your publication or presentation. There is no need to contact us for permission. Just provide this URL to your publisher, if they request it. Although there is no need to contact us for permission, we do appreciate hearing about innovative and interesting ways in which the corpora are used for research. Please feel free to send us an email (admin@english-corpora.org) to let us know about that, if you'd like.

COCA	Davies, Mark. (2008-) The Corpus of Contemporary American English (COCA). Available online at https://www.english-corpora.org/coca/.
(Frequency data)	Davies, Mark. (2008-) Word frequency data from The Corpus of Contemporary American English (COCA). Data available online at https://www.wordfrequency.info.
(N-grams data)	Davies, Mark. (2008-) N-grams data from The Corpus of Contemporary American English (COCA). Data available online at https://www.ngrams.info.
(Collocates data)	Davies, Mark. (2008-) Collocates data from The Corpus of Contemporary American English (COCA). Data available online at https://www.collocates.info.
iWeb	Davies, Mark. (2018) The iWeb Corpus. Available online at https://www.english-corpora.org/iWeb/.
COHA	Davies, Mark. (2010) The Corpus of Historical American English (COHA). Available online at https://www.english-corpora.org/coha/.
TIME	Davies, Mark. (2007) TIME Magazine Corpus. Available online at https://www.english-corpora.org/time/.
TV	Davies, Mark. (2019) The TV Corpus. Available online at https://www.english-corpora.org/tv/.
Movies	Davies, Mark. (2019) The Movie Corpus. Available online at https://www.english-corpora.org/movies/.
BNC	Davies, Mark. (2004) British National Corpus (from Oxford University Press). Available online at https://www.english-corpora.org/bnc/.
NOW	Davies, Mark. (2016-) Corpus of News on the Web (NOW). Available online at https://www.english-corpora.org/now/.
Coronavirus	Davies, Mark. (2019-) The Coronavirus Corpus. Available online at https://www.english-corpora.org/corona/.
GloWbE	Davies, Mark. (2013) Corpus of Global Web-Based English. Available online at https://www.english-corpora.org/glowbe/.
EEBO	Davies, Mark. (2017) Early English Books Online Corpus. Available online at https://www.english-corpora.org/eebo/.
Hansard	Davies, Mark. (2015) Hansard Corpus. Available online at https://www.hansard-corpus.org/.
Wikipedia	Davies, Mark. (2015) The Wikipedia Corpus. Available online at https://www.english-corpora.org/wiki/.
SOAP	Davies, Mark. (2011-) Corpus of American Soap Operas. Available online at https://www.english-corpora.org/soap/.
CORE	Davies, Mark. (2016-) Corpus of Online Registers of English (CORE). Available online at https://www.english-corpora.org/core/.
Supreme Court	Davies, Mark. (2017) Corpus of US Supreme Court Opinions. Available online at https://www.english-corpora.org/scotus/
CAN / Strathy	Davies, Mark. (2012-) The Strathy Corpus of Canadian English (from the Strathy Language Unit, Queen's University). https://www.english-corpora.org/can/.
Corpus del Español	Davies, Mark. (2016-) Corpus del Español: Web/Dialects. Available online at http://www.corpusdelespanol.org/web-dial/.
Corpus del Español	Davies, Mark. (2002-) Corpus del Español: Hiistorical/Genres. Available online at http://www.corpusdelespanol.org/hist-gen/.
Corpus do Português	Davies, Mark. (2016-) Corpus do Português: Web/Dialects. Available online at http://www.corpusdoportugues.org/web-dial/.
Corpus do Português	Davies, Mark and Michael Ferreira. (2006-) Corpus do Português: Historical Genres. Available online at http://www.corpusdoportugues.org/hist-gen/.
Google Books	Davies, Mark. (2011-) Google Books Corpus. (Based on Google Books n-grams). Available online at http://www.english-corpora.org/googlebooks/. Based on: Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331 (2011) [Published online ahead of print 12/16/2010].

（参考资源来源；https://www.english-corpora.org//faq.asp#x1）

王华树，刘世界，张成智. 翻译搜索指南[M]. 北京：中译出版社，2022.

图片预览