Joao Paulo Paiva Lima1,*, Denilson Alves Pereira1
Joao Paulo Paiva Lima
1 Department of Computer Science - Federal University of Lavras Campus Universitario UFLA, 3037, Minas Gerais, Brazil
Email: [email protected], [email protected]
Transformer-based large language models still have unexplored possibilities, specifically, in the use of multi-dimensional positioning during autoregressive generation. Non-sequential positioning is already a common feature in various visual document understanding and table understanding encoder models. It is often used by associating every input token with its x and y coordinates in a visual document or table. However, the same technique is left mostly unexplored when it comes to generative decoder models. In this work, we investigate whether decoder models for table generation can also be improved by incorporating a more complex type of positioning. We adapted a pretrained image-to-sequence model to incorporate three positional dimensions to each generated token in a table, with each dimension representing the token’s position inside a cell, the cell’s position inside a row, and the row’s position inside a table. The adapted model was then trained for the task of table recognition using the PubTabNet dataset. To assess its effectiveness, we compared the trained model’s performance against an identical baseline using standard positional encoding. The resulting model showed a significantly improved overall score over baseline (+1.2%), with a more pronounced advantage in complex (+2.2%) and very large tables (+16.9%).
Table recognition, generative model, transformer, natural language processing, image processing
Joao Paulo Paiva Lima, Denilson Alves Pereira (2025). Evaluating Three-Dimensional Topological Positioning in Generative Table Recognition. Journal of Artificial Intelligence and Systems, 7, 61–75. https://doi.org/10.33969/AIS.2025070104.
[1] Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents, 2023.
[2] Michael Cafarella, Alon Halevy, Hongrae Lee, Jayant Madhavan, Cong Yu, Daisy Zhe Wang, and Eugene Wu. Ten years of webtables. Proceedings of the VLDB Endowment., 11(12):2140–2149, aug 2018.
[3] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729, 2019.
[4] Max G¨obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 12th International Conference on Document Analysis and Recognition, pages 1449–1453. IEEE, 2013.
[5] Jonathan Herzig, Pawel Krzysztof Nowak, Thomas M¨uller, Francesco Piccinno, and Julian Eisenschlos. TaPas: Weakly supervised table parsing via pre-training. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4320–4333, Online, July 2020. Association for Computational Linguistics.
[6] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, page 4083–4091, New York, NY, USA, 2022. Association for Computing Machinery.
[7] Antonio Jimeno Yepes, Peter Zhong, and Douglas Burdick. Icdar 2021 competition on scientific literature parsing. In Document Analysis and Recognition – ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part IV, page 605–617, Berlin, Heidelberg, 2021. Springer-Verlag.
[8] Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. In 17th European Conference on Computer Vision – ECCV: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, page 498–517, Berlin, Heidelberg, 2022. Springer-Verlag.
[9] Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. TableBank: Table benchmark for image-based table detection and recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1918–1925, Marseille, France, May 2020. European Language Resources Association.
[10] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742, 2020.
[11] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
[12] Ning Lu, Wenwen Yu, Xianbiao Qi, Yihao Chen, Ping Gong, Rong Xiao, and Xiang Bai. MASTER: Multi-aspect non-local network for scene text recognition. Pattern Recognition, 2021.
[13] Nam Tuan Ly and Atsuhiro Takasu. An end-to-end multi-task learning model for image-based table recognition. In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, pages 626–634. SciTePress, 2023.
[14] Nam Tuan Ly and Atsuhiro Takasu. An end-to-end multi-task learning model for image-based table recognition. pages 626–634, 2023.
[15] Ahmed Samy Nassar, Nikolaos Livathinos, Maksym Lysak, and Peter W. J. Staar. Tableformer: Table structure understanding with transformers. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4604–4613, 2022.
[16] Sachin Raja, Ajoy Mondal, and C. V. Jawahar. Table structure recognition using top-down and bottom-up cues. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 70–86, Cham, 2020. Springer International Publishing.
[17] Brandon Smock, Rohith Pesala, and Robin Abraham. Pubtables-1m: Towards comprehensive table extraction from unstructured documents. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4624–4632, 2022.
[18] OpenAI Team. Gpt-4 technical report, 2024.
[19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc.
[20] Jiapeng Wang, Lianwen Jin, and Kai Ding. LiLT: A simple yet effective language-independent layout transformer for structured document understanding. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7747–7757, Dublin, Ireland, May 2022. Association for Computational Linguistics.
[21] Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL) 2021, 2021.
[22] Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Layoutlm: Pre-training of text and layout for document image understanding. 2020.
[23] Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu Wei. Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding. 2021.
[24] Jiaquan Ye, Xianbiao Qi, Yelin He, Yihao Chen, Dengyi Gu, Peng Gao, and Rong Xiao. Pingan-vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: Table recognition to html. arXiv preprint arXiv:2105.01848, 2021.
[25] Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Baocai Yin, Bing Yin, and Cong Liu. Semv2: Table separation line detection based on instance segmentation. Pattern Recognition, 149:110279, 2024.
[26] Xinyi Zheng, Douglas Burdick, Lucian Popa, and Nancy Xin Ru Wang. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. IEEE Winter Conference on Applications of Computer Vision (WACV), pages 697–706, 2020.
[27] Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: Data, model, and evaluation. In 16th European Conference on Computer Vision – ECCV 2020: Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI, page 564–580, Berlin, Heidelberg, 2020. Springer-Verlag.
[28] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022. IEEE, Sep. 2019.