dc.contributor.author |
Botha, G
|
|
dc.contributor.author |
Barnard, E
|
|
dc.date.accessioned |
2012-02-23T07:34:46Z |
|
dc.date.available |
2012-02-23T07:34:46Z |
|
dc.date.issued |
2005-11 |
|
dc.identifier.citation |
Botha, G and Barnard, E. Two approaches to gathering text corpora from the WorldWideWeb. Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005 |
en_US |
dc.identifier.isbn |
0-7992-2264-X |
|
dc.identifier.uri |
http://hdl.handle.net/10204/5587
|
|
dc.description |
Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005 |
en_US |
dc.description.abstract |
Many applications of pattern recognition to natural language processing require large text corpora in a specified language. For many of the languages of the world, such corpora are not readily available, but significant quantities of text are available on the World Wide Web. We describe and compare two approaches to gathering language-specific corpora from this resource, and show that the use of a commercial search engine as a first stage leads to good results. |
en_US |
dc.language.iso |
en |
en_US |
dc.publisher |
PRASA |
en_US |
dc.subject |
Text corpora |
en_US |
dc.subject |
Text collection |
en_US |
dc.subject |
Web-crawling |
en_US |
dc.title |
Two approaches to gathering text corpora from the WorldWideWeb |
en_US |
dc.type |
Conference Presentation |
en_US |
dc.identifier.apacitation |
Botha, G., & Barnard, E. (2005). Two approaches to gathering text corpora from the WorldWideWeb. PRASA. http://hdl.handle.net/10204/5587 |
en_ZA |
dc.identifier.chicagocitation |
Botha, G, and E Barnard. "Two approaches to gathering text corpora from the WorldWideWeb." (2005): http://hdl.handle.net/10204/5587 |
en_ZA |
dc.identifier.vancouvercitation |
Botha G, Barnard E, Two approaches to gathering text corpora from the WorldWideWeb; PRASA; 2005. http://hdl.handle.net/10204/5587 . |
en_ZA |
dc.identifier.ris |
TY - Conference Presentation
AU - Botha, G
AU - Barnard, E
AB - Many applications of pattern recognition to natural language processing require large text corpora in a specified language. For many of the languages of the world, such corpora are not readily available, but significant quantities of text are available on the World Wide Web. We describe and compare two approaches to gathering language-specific corpora from this resource, and show that the use of a commercial search engine as a first stage leads to good results.
DA - 2005-11
DB - ResearchSpace
DP - CSIR
KW - Text corpora
KW - Text collection
KW - Web-crawling
LK - https://researchspace.csir.co.za
PY - 2005
SM - 0-7992-2264-X
T1 - Two approaches to gathering text corpora from the WorldWideWeb
TI - Two approaches to gathering text corpora from the WorldWideWeb
UR - http://hdl.handle.net/10204/5587
ER -
|
en_ZA |