Vol. 4 No. 1 (2024): Journal of Millimeterwave Communication, Optimization and Modelling
Articles

Enhancing zero-shot learning based sign language recognition through hand landmarks and data augmentation

Giray Sercan Özcan
Baskent University

Published 29.02.2024

Keywords

  • sign language recognition,
  • zero-shot learning

Abstract

 

 

Abstract— Sign language recognition remains a challenging area and may require a considerable amount of data to obtain satisfactory results. To overcome this, we use readily available motion text data in addition to videos for achieving recognition of unobserved classes during the training phase. Zero-Shot Sign Language Recognition (ZSSLR) with a novel technique is focused on this work, which learns a model from seen sign classes and recognizes unseen sign classes. To achieve this, the ASL-Text dataset is used which combines the video of word signs and descriptions in sign language dictionaries. Moreover, this dataset consists of sign language classes and their corresponding definitions in the sign language dictionary. In various Zero-Shot Learning (ZSL) applications, it is common for datasets to contain a limited number of examples for numerous classes across different domains. This makes the problem of sign language recognition extremely challenging. We try to overcome this by using a new approach which includes augmented data and hand landmarks. The experiment on augmented data resulted in 50.91 for top-5 accuracy. Hand landmarks are used with unaugmented data which is applied to average and LSTM deep learning layers resulting in 49.41 and 48.21 for top-5 accuracies, respectively.

References

  1. J. Qin, L. Liu, L. Shao, F. Shen, B. Ni, J. Chen, and Y. Wang, “Zeroshot action recognition with error-correcting output codes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 2833–2842.
  2. M. Hahn, A. Silva, and J. M. Rehg, “Action2vec: A crossmodal embedding approach to action learning,” in The British Machine Vision Conference (BMVC), September 2018.
  3. W. C. Stokoe Jr. “Sign language structure: An outline of the visual communication systems of the american deaf.” Journal of deaf studies and deaf education, 10(1):3–37, 2005.
  4. Y. Wu and T. S. Huang, “Vision-based gesture recognition: A review. In International Gesture Workshop,” pp.103–115. Springer, 1999.
  5. Y. C. Bilge, N. İ. Cinbiş, R. G. Cinbiş, “Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages?,” in British Machine Vision Conference (BMVC), 2019
  6. A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. “Describing objects by their attributes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp.1778–1785. IEEE, 2009.
  7. C. H. Lampert, H. Nickisch, and S. Harmeling, “Attribute-based classification for zero-shot visual object categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014.
  8. G. Patterson and J. Hays, “Sun attribute database: Discovering, annotating, and recognizing scene attributes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp.2751–2758. IEEE, 2012
  9. C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. “The caltechucsd birds-200-2011 dataset,” 2011.
  10. P. Kumar, H. Gauba, P.P. Roy , D.P. Dogra, “A multimodal framework for sensor based sign language recognition,” Neurocomputing, vol.259, pp.21-38, 2017.
  11. O. Koller, J. Forster,H. Ney, “Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers,” in Computer Vision and Image Understanding, vol.141, pp.108-125, 2015.
  12. S. Tamura and S. Kawasaki, “Recognition of sign language motion images,” Pattern recognition, vol. 21, no. 4, pp. 343–353, 1988.
  13. M. B. Waldron and S. Kim, “Isolated asl sign recognition system for deaf persons,” IEEE Transactions on rehabilitation engineering, vol. 3, no. 3, pp. 261–271, 1995.
  14. M. W. Kadous et al., “Machine recognition of auslan signs using powergloves: Towards large-lexicon recognition of sign language,” in Proc. Workshop on the Integration of Gesture in Language and Speech, vol. 165, 1996
  15. M. Zahedi, D. Keysers, T. Deselaers, and H. Ney, “Combination of tangent distance and an image distortion model for appearancebased sign language recognition,” in Joint Pattern Recognition Symposium, 2005, pp. 401–408
  16. H. Cooper and R. Bowden, “Sign language recognition using boosted volumetric features,” in Proc. IAPR Conference on Machine Vision Applications, 2007, pp. 359–362
  17. O. Koller, O. Zargaran, H. Ney, and R. Bowden, “Deep sign: hybrid cnn-hmm for continuous sign language recognition,” In British Machine Vision Conference, 2016.
  18. K. Grobel and M. Assan, “Isolated sign language recognition using hidden markov models,” In IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation, vol. 1, pp. 162–167. IEEE, 1997.
  19. C. L. Huang and W.Y. Huang. “Sign language recognition using model-based tracking and a 3d hopfield neural network,” Machine vision and applications, 10(5-6):292–307, 1998
  20. Y. C. Bilge, N. İ. Cinbiş, R. G. Cinbiş, “Towards Zero-Shot Sign Language Recognition,” Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 1217-1232, 2023.
  21. B. Romera-Paredes and P. Torr, “An embarrassingly simple approach to zero-shot learning,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2152–2161.
  22. E. Kodirov, T. Xiang, and S. Gong, “Semantic autoencoder for zero-shot learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 3174–3183.
  23. N. Madapana and J. P. Wachs, “Hard zero shot learning for gesture recognition,” in IAPR International Conference on Pattern Recognition, 2018, pp. 3574–3579.
  24. J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proc. IEEE International Conference on Computer Vision, 2019, pp. 7083–7093.
  25. A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602–610, 2005.
  26. C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.L. Chang, M. Yong, J. Lee, W.T. Chang. “Mediapipe: A framework for perceiving and processing reality.” InThird Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019 Jun (Vol. 2019).
  27. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding,” in NAACL, 2019, pp. 4171–4186.
  28. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and theircompositionality,” in Proc. Adv. Neural Inf. Process. Syst., 2013, pp.3111–3119.
  29. J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proc. of conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
  30. J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 6299–6308.
  31. C. Neidle, A. Thangali, and S. Sclaroff, “Challenges in development of the american sign language lexicon video dataset (asllvd) corpus,” in Proc. 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon, Language Resources and Evaluation Conference (LREC) 2012, 2012
  32. S. Tamura and S. Kawasaki, “Recognition of sign language motion images,” Pattern recognition, vol. 21, no. 4, pp. 343–353, 1988.
  33. H. Wang, X. Chai, X. Hong, G. Zhao, and X. Chen, “Isolated sign language recognition with grassmann covariance matrices,” ACM Transactions on Accessible Computing (TACCESS), vol. 8, no. 4, p. 14, 2016.
  34. N. Cihan Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 7784–7793.
  35. M. B. Waldron and S. Kim, “Isolated asl sign recognition system for deaf persons,” IEEE Transactions on rehabilitation engineering, vol. 3, no. 3, pp. 261–271, 1995.
  36. K. Grobel and M. Assan, “Isolated sign language recognition using hidden markov models,” in IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation, vol. 1, 1997, pp. 162–167.
  37. C.-L. Huang and W.-Y. Huang, “Sign language recognition using model-based tracking and a 3d hopfield neural network,” Machine vision and applications, vol. 10, no. 5-6, pp. 292–307, 1998.
  38. D. Li, C. Rodriguez, X. Yu, and H. Li, “Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison,” in The IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 1459–1469.
  39. B. Saunders, N. C. Camgoz, and R. Bowden, “Progressive transformers for end-to-end sign language production,” in European Conference on Computer Vision (ECCV), 2020.
  40. C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 951–958.
  41. J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions by attributes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2011, pp. 3337–3344.
  42. M. Jain, J. C. van Gemert, T. Mensink, and C. G. Snoek, “Objects2action: Classifying and localizing actions without any video example,” in Proc. IEEE Int. Conf. on Computer Vision, 2015, pp. 4588–4596.
  43. X. Xu, T. M. Hospedales, and S. Gong, “Semantic embedding space for zero-shot action recognition,” 2015 IEEE International Conference on Image Processing (ICIP), pp. 63–67, 2015.
  44. X. Xu, T. Hospedales, and S. Gong, “Transductive zero-shot action recognition by word-vector embedding,” International Journal of Computer Vision, vol. 123, no. 3, pp. 309–333, 2017.
  45. Q. Wang and K. Chen, “Alternative semantic representations for zero-shot human action recognition,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2017, pp. 87–102.
  46. A. Habibian, T. Mensink, and C. G. Snoek, “Video2vec embeddings recognize events when examples are scarce,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 10, pp. 2089–2103, 2017.