COSH stands for Corpus Of Spoken Hindi. This is a large-scale web corpus with data collected from Internet pages written in Devanagari script (UTF8). It consists of nearly two hundred million words.
In order to analyze the web corpus, a dedicated concordancer – COSH Conc – was developed by Osaka University and the Lago Language Institute. This tool enables us, mostly language researchers, to extract the occurrence distribution of any Hindi word or phrase, including word clusters, collocation analysis, grammatical behavior, and word counts.
COSH and COSH Conc were developed for non-native speakers of Hindi to perform linguistic research on this language. In particular, COSH Conc is customized to handle texts in Unicode Devanagari fonts.
Use the following reference to cite the COSH:
Miki Nishioka (Osaka University) and Lago Language Institute (2016-2017). Corpus Of Spoken Hindi (COSH) and COSH Conc [Software]. Available from http://www.cosh.site
Development of COSH and the concordancer
Since COSH is a web corpus, it is contains primarily web data such as newspapers and blogs. The raw data collected contained frequent duplicates, as well as some pages of other Indian languages in Devanagari script. We have removed these as much as possible.
COSH did not include Hindi data in Roman script, due to the lack of a standard transliteration. Initial problems in searching the corpus and concordancer for words containing Nuqtā ़ were solved.
We encountered some problems in annotation, as the Internet contains countless new words, mostly English loanwords rendered in Devanagari. And since most of the Internet texts are in a natural, mistake-prone language, there may be spelling errors, typos, and some incorrect annotations.
SCoRE Development and Organization
The concordancer COSH Conc is a tool for searching a specific word, word clusters in the COSH corpus.
Publications
User Guide
COSH User Guide (PDF format)
Terms of use and disclaimer
The content on the COSH website at www.cosh.site (this site) is provided by Miki Nishioka (Osaka University®). By accessing and using this site, you agree to be bound by the following terms.
1) This system supports Chrome, Firefox, Safari and IE (versions 11 and above). We recommend that you use Chrome or Firefox for optimal processing speed.
2) The data should be used for research or education purposes only. Redistribution of the data is prohibited.
3) When using COSH, enable cookies in your browser. If cookies are disabled, the data will not display.
4) The data of this corpus is collected from publicly accessible websites, and processed for teaching and research purposes. Every example is shown with its source and URL. If the copyright owner of a text wishes to delete it from COSH, please contact us using the contact form. We will delete the text upon identifying the applicant as the owner of the webpage.
5) While we will make reasonable efforts to provide the site's services, we will bear no responsibility for unavailability or any failure to provide service. External URL's to source texts included on our website have not been checked for viruses, malware, or any other electronic hazards, and you access them at your own risk. We provide no warranty against any other loss or defect that may result from accessing them. We will accept no responsibility for any loss or damage resulting from failure to provide service or as a result of accessing an external link.
Acknowledgement
This project was supported by a Grant-in-Aid for Scientific Research (C) [KAKENHI: no. JP15K02517], provided by the Japan Society for the Promotion of Science (JSPS).
We thank Thorsten Brants and Shiva Reddy, who gave us free use of Trigrams'n'Tags (TnT) and the Hindi POS tagger, respectively. We are also grateful to BlackLab, which enabled us to search the large body of text in the web corpus.
Contact us for any questions about COSH and COSH Conc. We also appreciate any comments or suggestions on them. Please provide the following information and click the Confirm Button.