Automatic Knowledge Acquisition for Lexicography

COST ENeL WG3 meeting
Herstmonceux castle, UK, 13 August 2015

Agenda

14:00-14:10Welcome
14:10-14:25Automatic Acquisition of Knowledge survey results (Carole Tiberius, Kris Heylen)
14:25-14:40Example sentences

Extracting (good) discourse examples from an oral specialised corpus of wine tasting interactions (Patrick Leroyer, Laurent Gautier & Hedi Maazaoui)

Paper

14:40-14:55Translation equivalents

Automatic Extraction of Bilingual Slovak-English equivalents (Radovan Garabik & Agáta Karcová)

14:55-15:10Bilex (Lionel Nicolas, Verena Lyding & Egon Stemle)
15:10-15:15Bilingual Translation Equivalents: short summary & discussion
15:15-15:30Lemma & Frequency

Introducing SVALex: a corpus-based lexical resource for second language learning (Elena Volodina, Ildikó Pilán, Thomas François)

15:30-16:00Coffee break
16:00-16:15Semantics

SKE & visualisation of word senses (Miloš Jakubíček & Kris Heylen)

16:15-16:30Extending a Large Semantic Network (Luis Espinosa-Anke)
16:30-16:45Neologisms (Carole Tiberius, Egon Stemle & Miloš Jakubiček)
16:45-17:00Multiword expressionsTowards a corpus-based online dictionary of Italian Word Combinations (Sara Castagnoli, Gianluca E. Lebani, Alessandro Lenci, Francesca Masini, Malvina Nissim & Valentina Piunno)

Paper

17:00-17:10MWE Survey results (Simon Krek)
17:10-17:30Discussion, MWE workshop Skopje, use cases for Friday’s WG-all meeting & next meeting (Kris Heylen, Simon Krek & Carole Tiberius)

Call for Papers

The next meeting of COST-ENeL Working Group 3 in Herstmonceux Castle (August 13th) will be dedicated to Automatic Knowledge Acquisition (AKA) for Lexicography (Organisers: Carole Tiberius, Simon Krek and Kris Heylen). The survey, that you were kindly invited to complete in April, already showed that a wide variety of AKA types are used in our community. We will present an overview of the survey at the meeting. As a next step in preparation for the meeting, we now call on you to submit short papers (1000-1500 words) by June 1st on the software and methods that your institution is using or developing to acquire specific types of information (semi)-automatically from corpora for lexicographical purposes. These knowledge types include:

  • (Candidate) Lemma list
  • Overall Lemma Frequency information
  • Form variation (e.g. irregular morphology, orthographic variants)
  • Multiword expressions (i.e. sequences of words with some unpredictable properties such as “to count somebody in” or “to take a haircut”, ranging from collocations and phrasal verbs, (pragmatic) frozen expressions (e.g. of course, good morning) to traditional idioms, proverbs etc.)
  • Neologisms
  • Definitions
  • Knowledge Rich Contexts (i.e. in terminography, a sort of hybrid of a good example and a definition, illustrating the meaning characteristics of a term, but not being a formal definition.)
  • Lexical-semantic relations (e.g. synonyms, antonyms, hypernyms)
  • Word senses
  • Grammatical patterns (e.g. word profiles, valency)
  • Linguistic labels (domain/ region/ dialect/ register/ style/ time/ slang and jargon/ attitude/ offensive terms)
  • Translations and other cross-linguistic information

(Note that Good Example Extraction is not included since it was the focus of the Vienna workshop).

We are especially interested in papers that also discuss the relevance of AKA for the lexicographic process and, more particularly, how the automatically acquired information is evaluated in the context of a lexicographic project. The meeting’s schedule allows for 7 to 10 papers to be presented at the meeting in 15 min. presentations. The selection will be made by the organisers based on topic relevance. Additionally, accepted papers will also be made available through the website of working group 3 as the deliverable of the meeting.

GUIDELINES:

Length: 1000-1500 words (references not included),

Stylesheet: ELEX-style-guide but with length adjusted to 1000-1500 words and WITHOUT abstract. https://elex.link/elex2015/instructions-for-authors/

Format: MS Word, PDF, Latex

Deadline: June 1st 2015

Submission: by e-mail to both Kris.Heylen@kuleuven.be and carole.tiberius@inl.nl with subject line: “COST-WG3 paper submission 2015”

Notification of acceptance : June 15 2015

Place/date of the meeting: Herstmonceux Castle (August 13th 2015)

Meeting website: http://www.elexicography.eu/working-groups/working-group-3/wg3-meetings/wg3-herstmonceux-2015/

Accepted papers qualify for reimbursement.

PURPOSE OF THE QUESTIONNAIRE

The aim of this questionnaire is to create an inventory of different types of automatic knowledge acquisition which are currently used within the framework of lexicographical projects (general, specialised, bilingual, synchronic, diachronic etc.). We also like to find out what works and what doesn’t work with respect to automatic knowledge acquisition for the purpose of dictionary creation.

WHAT DO WE MEAN BY AUTOMATIC KNOWLEDGE ACQUISTION ?

By automatic knowledge acquisition we mean knowledge (data) which

  • is automatically obtained from corpora of authentic language use (both synchronic and diachronic);
  • forms either the input for lexicographers (who further inspect and edit the data) or is included as is in the published dictionary (possibly marked as being knowledge which has been automatically derived from corpus data).

We distinguish different types of automatically acquired knowledge including (but not limited to):

  • (Candidate) Lemma list
  • Overall Lemma Frequency information
  • Form variation (e.g. irregular morphology, orthographic variants)
  • Example sentences (cf. Vienna COST workshop)
  • Multiword expressions (i.e. sequences of words with some unpredictable properties such as “to count somebody in” or “to take a haircut”, ranging from collocations and phrasal verbs, (pragmatic) frozen expressions (e.g. of course, good morning) to traditional idioms, proverbs etc.)
  • Neologisms
  • Definitions
  • Translation Equivalents
  • Knowledge Rich Contexts (i.e. in terminography, a sort of hybrid of a good example and a definition, illustrating the meaning characteristics of a term, but not being a formal definition.)
  • Lexical-semantic relations (e.g. synonyms, antonyms, hypernyms)
  • Word senses
  • Grammatical patterns (e.g. word profiles, valency)
  • Linguistic labels (domain/ region/ dialect/ register/ style/ time/ slang and jargon/ attitude/ offensive terms)

As the questionnaire is rather complicated we would like to ask you to read these instructions before you start:

  1. Before you start filling in the online form it is advisable to download the PDF file with the entire questionnaire and familiarise yourselves with the structure.
  2. All members of Working group 3 are required to fill in the questionnaire. Completing the questionnaire is one of the requirements to qualify for reimbursement for the meeting at Herstmonceux Castle. If you don’t do any automatic knowledge acquisition within your lexicographical project(s), it will take you a minute or two. Otherwise it should take from a half an hour to an hour.
  3. Completing the questionnaire will most likely require input from different people within your institution, i.e. computational linguists, corpus linguists, software engineers, lexicographers. We suggest that you collaborate and complete one survey per institution.
  4. If you have questions BEFORE, DURING or AFTER you fill in the questionnaire, do contact us.
  5. Deadline for completing the questionnaire is April 26.