26 Tháng Hai 2020
Diễn đànDiễn đànCông nghệ thông...Công nghệ thông...Công cụ nghiên ...Công cụ nghiên ...GATE, công cụ nghiên cứu về xử lý văn bảnGATE, công cụ nghiên cứu về xử lý văn bản
Trước Trước
Tiếp Tiếp
Bài mới
 27/12/2009 4:15 CH

Chào các bác,

GATE là viết tắt của "General Architecture for Text Engineering" = Kiến trúc tổng quát cho công nghệ xử lý văn bản. Mời các bác tham khảo. Bữa nào rảnh tôi xin phân tích thêm về ý nghĩa của nó trong việc phát triển các công cụ xử lý văn bản. Nếu có bác nào đã tìm hiểu, xin mời cho trước một vài ý kiến.


GATE: a full-lifecycle open source solution for text processing

1. Introduction

GATE is nearly 15 years old and is in active use for all types of computational task involving human language. GATE excels at text analysis of all shapes and sizes. From large corporations to small startups, from €multi-million research consortia to undergraduate projects, our user community is the largest and most diverse of any system of this type, and is spread across all but one of the continents1.

Core GATE is open source free software; users can obtain free support from the user and developer community via GATE.ac.uk or on a commercial basis from our industrial partners. We are the biggest open source language processing project with a development team more than double the size of the largest comparable projects (many of which are integrated with GATE2). More than €5 million has been invested in GATE development3; our objective is to make sure that this continues to be money well spent for all GATE's users.

This note summarises the GATE software and process and gives examples of some of their uses. We believe that GATE is the leading system of its type, but as scientists we have to advise you not to take our word for it; that's why we've measured our software in many of the competitive evaluations over the last decade-and-a-half (MUC, TREC, ACE, DUC, ...). We invite you to give it a try, to get involved with the GATE community, and to contribute to human language science, engineering and development.

2. The GATE Family

GATE has grown over the years to include a desktop client for developers, a workflow-based web application, a Java library, an architecture and a process. GATE is:

  • an IDE, GATE Developer4: an integrated development environment for language processing components bundled with a very widely used Information Extraction system and a comprehensive set of other plugins
  • a web app: GATE Teamware a collaborative annotation environment for factory-style semantic annotation projects built around a workflow engine and a heavily-optimised backend service infrastructure
  • a framework, GATE Embedded: an object library optimised for inclusion in diverse applications giving access to all the services used by GATE Developer and more
  • an architecture: a high-level organisational picture of how language processing software composition
  • a process for the creation of robust and maintainable services

We also develop:

  • a wiki/CMS (GATE Wiki.sf.net), mainly to host our own websites and as a testbed for some of our experiments
  • a cloud computing solution for hosted large-scale text processing (GATE Cloud.net)

For more information see the family pages.

One of our original motivations was to remove the necessity for solving common engineering problems before doing useful research, or re-engineering before deploying research results into applications. Core functions of GATE take care of the lion's share of the engineering:

  • modelling and persistence of specialised data structures
  • measurement, evaluation, benchmarking (never believe a computing researcher who hasn't measured their results in a repeatable and open setting!)
  • visualisation and editing of annotations, ontologies, parse trees, etc.
  • a finite state transduction language for rapid prototyping and efficient implementation of shallow analysis methods (JAPE)
  • extraction of training instances for machine learning
  • pluggable machine learning implementations (Weka, YALE, SVM Lite, ...)

On top of the core functions GATE includes components for diverse language processing tasks, e.g. parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. GATE Developer and Embedded are supplied with an Information Extraction system (ANNIE) which has been adapted and evaluated very widely (numerous industrial systems, research systems evaluated in MUC, TREC, ACE, DUC, Pascal, NTCIR, etc.). ANNIE is often used to create RDF or OWL (metadata) for unstructured content (semantic annotation).

GATE version 1 was written in the mid-1990s; at the turn of the new millennium we completely rewrote the system in Java; version 5 was released in June 2009.

2.1. Component model

One of the reasons GATE has lasted well and been successful is that the entire core is broken down into reusable chunks (using the original Java component model). Some of the APIs available in Embedded are summarised here:

3. First Cousins - the Ontotext Family

Complementing GATE's development and collaborative distributed annotation tools, KIM and Mímir provide a straightforward deployment option (front-end, back-end).

  • Ontotext KIM: multiparadigm search UIs for information management, navigation and search including KIM conceptual query, KIM CORE, and the ANNIC Annotations in Context tool
  • Ontotext Mímir: (Multi-paradigm Information Management Index and Repository) a massively scalable multiparadigm index built on Ontotext's semantic repository family, GATE's annotation structures database plus full-text indexing from MG4J

Many systems developed with GATE are embedded into existing applications of one sort or another; the Ontotext family provide a good alternative to this approach, and GATE-based annotation with a KIM/Mímir index and search engine represents a robust and mature solution for text analysis for enterprise search and similar.

4. Where next?

Hungry for more? A summary of the main sources of documentation and where to get help:

Good luck!


  1. Rumours that we're planning to send several of the development team to Antarctica on one-way tickets are, of course, false, libellous and wishful thinking.
  2. Our philosophy is reuse not reinvention, so we integrate and interoperate with other systems, e.g.: LingPipe, OpenNLP, UIMA, and many more specific tools.
  3. This is the figure for direct Sheffield-based investment only and therefore an underestimate.
  4. GATE Developer and GATE Embedded are bundled, and in older distributions were referred to just as "GATE".

Bài mới
 18/06/2011 3:35 CH

mình đang tìm hiểu công cụ GATE để phục vụ cho LV cuối khoá. tìm mãi mới thấy bài này mà chưa rõ lắm.

Bạn có thể tiếp tục hướng dẫn cách thức khai thác và sử dụng công cụ này được k? :) giờ mình mới tìm hiểu..đang đi tìm sư phụ trong lĩnh vực xử lý ngôn ngữ tự nhiên đây..

Rất, rất mong bài tiếp theo ..!

Bài mới
 19/06/2011 11:56 CH

Chào bác nthquyen,

Tôi cũng muốn tìm hiểu GATE đã lâu, tuy nhiên chưa có người cùng làm. Đang muốn kiếm học trò để làm cùng :-)). Bác đang học ở trường nào? Theo tôi hiểu đây là một platform để xử lý ngôn ngữ tự nhiên, nhưng chủ yếu cho các nhà ngôn ngữ.

Tôi đang có ý tưởng khai thác GATE để làm một Platform đơn giản hơn, sử dụng cái gọi là SPI - Service Programming Interface (bắt chước theo API) để cho phép các nhà ngôn ngữ lập trình scripting một số ứng dụng khai thác khối liệu và thống kê các quy luật ngôn ngữ.

GATE hơi phức tạp vì một là mở, hai là đa năng nên hơi phức tạp. Để khi nào rảnh tôi đọc thêm một chút rồi bình luận thêm

Trước Trước
Tiếp Tiếp
Diễn đànDiễn đànCông nghệ thông...Công nghệ thông...Công cụ nghiên ...Công cụ nghiên ...GATE, công cụ nghiên cứu về xử lý văn bảnGATE, công cụ nghiên cứu về xử lý văn bản