Talk:SMC/SoC/2008
From FOSS Community India
The whole swathantra malayalam corpus is aimed at building a Free and Open source annotated corpus,related APIs, programs to build different types of corpus etc.
Details:
- Needs an annotated image and speech corpus to support the Speech and image related FOSS driven research and development.
- It should be able to act as a standard train and test data for the R&D activities.
- In the first phase need to build a specification document, clearly written manual for building the corpus and should build the tools needed to build the corpus and use the corpus.
- Anybody who like to contribute to the project must be able to do so and the specifications should be of the best covering all the aspects on classification of data, annotation of data, structure of storage and all related details.
- As a part of the project, when we finish the summer, we must be able to build a complete specification document(document, explanations,related presentations, demo files etc.) and programs to build the corpus and access the corpus(building the whole process must be a collaborative effort, it is not coming under this phase).
- More importantly, the structure should be an extensible one for all indic languages.
Please add more details that can be added to a corpora project.
Jinesh , this is good idea, but I think there is one problem. If it is just "documentation of what to do", it is not suitable for the SOC project. But I know that you want it in phases. We dont have any guarantee that student will continue the rest of phases. If it is complete project with in 6 months , then I dont think that it can be done as a student project... What do u say? I am moving this idea to discussion temporarily