Next Article in Journal
Data Fusion of Scanned Black and White Aerial Photographs with Multispectral Satellite Images
Next Article in Special Issue
Making Japenese Ukiyo-e Art 3D in Real-Time
Previous Article in Journal
Fall Risk in Older Adults Transitioning between Different Flooring Materials
 
 
Article
Peer-Review Record

Learning to Describe: A New Approach to Computer Vision Based Ancient Coin Analysis

by Jessica Cooper * and Ognjen Arandjelović *
Reviewer 1:
Reviewer 2:
Submission received: 22 February 2020 / Accepted: 24 February 2020 / Published: 17 April 2020
(This article belongs to the Special Issue Machine Learning and Vision for Cultural Heritage)
Version 1
DOI: 10.3390/sci2010008

Round 1

Reviewer 1 Report

A common approach followed in literature for ancient coin classification has been recognising authorised issuers of coins using various image representations and classification algorithms. Differently from the common approach, in this paper, it is aimed to recognise the semantic elements at the motifs on the images of reverse sides of ancient coins. More specifically, semantic class names are settled from textual descriptions of coins and corresponding visual representations are explored on coin images. While proposed approach is quite interesting, presentation of the experimental design was not sufficiently clear to me. Besides, while the same problem is tackled at a previous publication [1] of the same authors, at the present work I could not determine methodological or experimental extension over the previous one.  

Some notes of mine are as follows:

  • It is mentioned at the first paragraph of the Introduction section that present work extends the previous work [1] of the same authors. However, I could not determine any extension, neither experimental nor methodological, both works seem extremely similar to each other. The only difference I could detect is visualisations of the learned filters at Fig. 12 and 13 at the current paper. If there is extension in a number of aspects that I could not recognise, could the authors list them at the corresponding paragraph in the Introduction section?  
  • Seems as the same neural network architecture has already been proposed and used for coin classification at a previous paper of the same authors [Schlag I, Arandjelovic O. Ancient Roman coin recognition in the wild using deep learning based recognition of artistically depicted face profiles. In ICCVW 2017 (pp. 2898-2906).] Would be fine to see the citation to that work at explanation of the framework in the related section (3.Proposed Framework) by a mention on (if there is) the difference in the current approach. 
  • Semantic labels are settled based on most frequent terms in the text descriptions of coins. I could not get from the manuscript why it is limited to five classes. A histogram graph depicting frequency of all terms in textual descriptions would be helpful to figure out such point. What was the initial size of the overall dataset and after getting the images related to such chosen semantic classes have the remaining images been neglected or have they been used to choose the negative examples from? 
  • The specifications of the overall dataset used in the experiments are not clear to me. I saw that Horse, Cornucopia, Patera, Eagle and Shield classes have around 18K, 14K, 5K, 14K and 18K images respectively. However, these visual elements can mutually appear on the same images (e.g. visual elements of patera and cornucopia appear on the same image at Fig.6 (row 2, col2)). Then, (1) what is the overall size of the dataset (around 69K?) or each set that consists of 18K, 14K, 5K, 14K and 18K images have some intersections due to mutually appearing elements (i.e. so total amount is less than 69K?) (2) from which set the negative examples are chosen?
  • From Table 10, it is understood that training is done separately for each of the five image sets. I did not get if training is done using 2-class labels or 5-class labels (at some sense it seems as a 2-class classification problem because it is mentioned several times that positive and negative examples are used in the experiments - then again how the negative classes is decided?). Can it be mentioned more apparent in manuscript? 
  • At Figure 11 caption, it is written that the identified salient regions corresponds to a cornucopia, a patera, and a shield, respectively. The last one should possibly be eagle, not shield. Could the authors give an example visualisation also for shield?
  • Fig 12 and 13 do not seem useful to me, because it is not possible to discriminate the difference between them. 

Author Response

General -- Thank you for your comments. We will update the manuscript to clarify these points. Please see below for detailed responses and directions to relevant sections of the manuscript. It is mentioned at the first paragraph of the Introduction section that present work extends the previous work [1] of the same authors. However, I could not determine any extension, neither experimental nor methodological, both works seem extremely similar to each other. The only difference I could detect is visualisations of the learned filters at Fig. 12 and 13 at the current paper. If there is extension in a number of aspects that I could not recognise, could the authors list them at the corresponding paragraph in the Introduction section? -- Our submission is an extension of our previous conference paper and contains more theoretical content underpinning the algorithm, further experiments including more detailed exploration of the dataset and a more in-depth analysis and discussion of findings and their relevance. Seems as the same neural network architecture has already been proposed and used for coin classification at a previous paper of the same authors [Schlag I, Arandjelovic O. Ancient Roman coin recognition in the wild using deep learning based recognition of artistically depicted face profiles. In ICCVW 2017 (pp. 2898-2906).] Would be fine to see the citation to that work at explanation of the framework in the related section (3.Proposed Framework) by a mention on (if there is) the difference in the current approach. -- Our network architecture is different to the one you mention - that model used five convolution blocks, each consisting of “two sets of convolutional layers, batch normalization, and rectified linear unit activation… The final architecture is made up of five consecutive convolutional blocks and max-pooling pairs. The number of filters is doubled after every pooling layer with the exception of the last layer.” Whereas we use an architecture closer to AlexNet, as noted in our submission. We do not use blocks or apply batch normalization, nor double the number of filters after each pooling layer. Indeed, we do not use pairs of convolutional blocks and max pooling either. The kernel sizes for each operation are not explicitly given in the paper you mention, but it appears they are likely to be smaller than ours, since they reference Simonyan and Zisserman “who demonstrated that a carefully crafted network built using few small (3×3), stacked kernels is superior to one comprising bigger kernels in terms of describability and computational cost” - in contrast, we use larger kernels as we found that they gave better performance. Semantic labels are settled based on most frequent terms in the text descriptions of coins. I could not get from the manuscript why it is limited to five classes. A histogram graph depicting frequency of all terms in textual descriptions would be helpful to figure out such point. What was the initial size of the overall dataset and after getting the images related to such chosen semantic classes have the remaining images been neglected or have they been used to choose the negative examples from? -- A histogram depicting the frequency of all terms is infeasible - there are many thousands of possible terms as evident in the sample attributions in Figure 3. We limit our work to the most frequent five terms for reasons of time cost and feasibility. The specifications of the overall dataset used in the experiments are not clear to me. I saw that Horse, Cornucopia, Patera, Eagle and Shield classes have around 18K, 14K, 5K, 14K and 18K images respectively. However, these visual elements can mutually appear on the same images (e.g. visual elements of patera and cornucopia appear on the same image at Fig.6 (row 2, col2)). Then, (1) what is the overall size of the dataset (around 69K?) or each set that consists of 18K, 14K, 5K, 14K and 18K images have some intersections due to mutually appearing elements (i.e. so total amount is less than 69K?) (2) from which set the negative examples are chosen? From Table 10, it is understood that training is done separately for each of the five image sets. I did not get if training is done using 2-class labels or 5-class labels (at some sense it seems as a 2-class classification problem because it is mentioned several times that positive and negative examples are used in the experiments - then again how the negative classes is decided?). Can it be mentioned more apparent in manuscript? -- Section 4: “our data comprised 100,000 images and their associated textual descriptions.”; -- Binary labelling for each element is correct. Please see section 2.2.3: “We shuffle the samples before building training, validation and test sets for each of the selected elements. To address under-representation of positive examples, we use stratified sampling to ensure equal class representation [18], matching the number of positive samples for each class with randomly selected negative samples (and thereby doubling the size of the dataset for each element). This provides us with datasets for each element of the following sizes: ‘horse’: 17,978; ‘cornucopia’: 13,956; ‘patera’: 5,330; ‘eagle’: 14,028; ‘shield’: 17,546 each of which we split with a ratio of 70% training set, 15% validation set and 15% test set.” At Figure 11 caption, it is written that the identified salient regions corresponds to a cornucopia, a patera, and a shield, respectively. The last one should possibly be eagle, not shield. Could the authors give an example visualisation also for shield? -- Thank you, we will change this. Fig 12 and 13 do not seem useful to me, because it is not possible to discriminate the difference between them. -- The difference is subtle, but do you not see a leaning towards recognising diagonal edges, as one would expect given the shape of the cornucopia in the first, and more small curved shapes in the second? We will investigate visualisation of the filters of deeper layers.

Reviewer 2 Report

The paper presents a method to detect the presence of common elements on ancient coins (horse, shield etc.) using a convolutional neural network. The problem is rendered extremely complicated by the fact that annotations that are used in training (which have been made by professional coin dealers) are unstructured.

 

The paper is very well written, and the results obtained are remarkable.

Author Response

The paper is very well written, and the results obtained are remarkable. Thank you!

Round 2

Reviewer 1 Report

Thanks to the authors for the revised version. 

Back to TopTop