DeepSeek vs. ChatGPT: Which Performs Better in Python Coding?

Document Type : Research Paper

Author

Department of Information Technology, Palestine Technical University, Kadoorie, Palestine.

10.22059/jitm.2026.107165

Abstract

This paper conducts a comparative evaluation of two advanced large language models (LLMs) — ChatGPT-4 and DeepSeek v3—utilizing 80 algorithmic problems from Code forces categorized into four difficulty levels: Easy (800–1100), Intermediate (1200–1600), Advanced (1700–2000), and Expert (2100–2400), focusing on code generation in Python. Standardized prompts and controlled testing conditions enable the assessment of models on accuracy, effi-ciency, and code readability. As the complexity of issues increases, DeepSeek frequently out-performs ChatGPT in both accuracy and efficiency, despite both models excelling in simpler tasks. This, however, results in reduced code clarity and increased memory use. While less pre-cise at elevated levels, ChatGPT produces more concise and idiomatic responses. Both models had limited competence at the expert level; however, DeepSeek-R1 indicated a slight edge. The study illustrates a trade-off between accuracy and code clarity, so as to inform the selection of LLMs based on task requirements and provide a foundation for future efforts in optimizing code generation models for actual applications.

Keywords


Anand, A., Gupta, A., Yadav, N., & Bajaj, S. (2024). A comprehensive survey of AI-driven advancements and techniques in automated program repair and code generation. arXiv. http://arxiv.org/abs/2411.07586
Buscemi, A. (2023). A comparative study of code generation using ChatGPT 3.5 across 10 programming languages. arXiv. http://arxiv.org/abs/2308.04477
Cambaz, D., & Zhang, X. (2024). Use of AI-driven code generation models in teaching and learning programming: A systematic literature review. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education (SIGCSE 2024) (Vol. 1, pp. 172–178).
Chen, X., Liu, C., & Song, D. (2018). Tree-to-tree neural networks for program translation.
Chung, D. J. H., Gao, Z., Kvasiuk, Y., Li, T., Münchmeyer, M., Rudolph, M., Sala, F., & Tadepalli, S. C. (2025). Theoretical physics benchmark (TPBench): A dataset and study of AI reasoning capabilities in theoretical physics. arXiv. http://arxiv.org/abs/2502.15815
Cruz-Benito, J., Vishwakarma, S., Martin-Fernandez, F., & Faro, I. (2021). Automated source code generation and auto-completion using deep learning: Comparing and discussing current language model-related approaches. AI (Switzerland), 2(1), 1–16.
Dou, S., Jia, H., Wu, S., Zheng, H., Zhou, W., Wu, M., Chai, M., Fan, J., Huang, C., Tao, Y., Liu, Y., Zhou, E., Zhang, M., Zhou, Y., Wu, Y., Zheng, R., Wen, M., Weng, R., Wang, J., … Huang, X. (2024). What’s wrong with your code generated by large language models? An extensive study. arXiv. http://arxiv.org/abs/2407.06153
Fan, Z., Gao, X., Mirchev, M., Roychoudhury, A., & Tan, S. H. (2023). Automated repair of programs from large language models. arXiv. http://arxiv.org/abs/2205.10583
Gao, C., Hu, X., Gao, S., Xia, X., & Jin, Z. (2025). The current challenges of software engineering in the era of large language models. ACM Transactions on Software Engineering and Methodology, 34(5).
Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., & Wang, H. (2024). Large language models for software engineering: A systematic literature review. arXiv. http://arxiv.org/abs/2308.10620
Huang, H., Wang, S., Liu, H., Wang, H., & Wang, Y. (2024). Benchmarking large language models on communicative medical coaching: A dataset and a novel system. In Findings of the Association for Computational Linguistics: ACL 2024 (pp. 1624–1637). https://aclanthology.org/2024.findings-acl.94.pdf
Jiang, J., Wang, F., Shen, J., Kim, S., & Kim, S. (2018). A survey on large language models for code generation. arXiv. http://arxiv.org/abs/2406.00515
Jiménez, Á. B. (2024). An evaluation of LLM code generation capabilities through graded exercises. arXiv. http://arxiv.org/abs/2410.16292
Joshi, S. (2025). A comprehensive review of DeepSeek: Performance, architecture and capabilities. Preprints.
Karlsson, A. (2024). Evaluating programming proficiency of large language models—Assessing large language models’ effectiveness in function and class generation, code commenting, robustness, and security [Master’s thesis, Linköping University]. https://www.divaportal.org/smash/get/diva2:1877998/FULLTEXT01.pdf
Ladegaard, I. (2025). Differentiation by disruption: Gatekeeper perspectives on “AI-aided writing” in three academic disciplines. Socius, 11.
Le, H. (2024). The evolving role of programmers in an AI-chatbot dominated world: Challenges, adaptation strategies, and future prospects [Doctoral dissertation, University of the Cumberlands]. https://www.proquest.com/openview/1e11d675b13b0ff2df4c28b2fda4c53a/1.pdf
Li, M., & Krishnamachari, B. (2024). Evaluating ChatGPT-3.5 efficiency in solving coding problems of different complexity levels: An empirical analysis. arXiv. http://arxiv.org/abs/2411.07529
Manik, M. M. H. (2025). ChatGPT vs. DeepSeek: A comparative study on AI-based code generation.
Mulder, R., Aivaloglou, F., & Zhang, X. (2023). AI in coding: How can code generation models support developing computational thinking skills? The use of code generation models in programming support activities. http://repository.tudelft.nl/
Shakya, R., Vadiee, F., & Khalil, M. (2025). A showdown of ChatGPT vs DeepSeek in solving programming tasks. In International Conference on New Trends in Computing Sciences (pp. 413–418). IEEE. https://arxiv.org/pdf/2503.13549
Shi, L., Tang, Z., Zhang, N., Zhang, X., & Yang, Z. (2024). A survey on employing large language models for text-to-SQL tasks. ACM Computing Surveys, 58(2).
Tang, X., Qian, B., Gao, R., Chen, J., Chen, X., & Gerstein, M. B. (2024). BioCoder: A benchmark for bioinformatics code generation with large language models. Bioinformatics, 40(Supplement_1), i266–i276.
Wang, X., Gong, Z., Wang, G., Jia, J., Xu, Y., Zhao, J., Fan, Q., Wu, S., Hu, W., & Li, X. (2023). ChatGPT performs on the Chinese National Medical Licensing Examination.
Xu, H., & Yu, X.-Y. (2025). From PowerPoint UI sketches to web-based applications: Pattern-driven code generation for GIS dashboard development using knowledge-augmented LLMs, context-aware visual prompting, and the React framework. http://arxiv.org/abs/2502.08756
Yao, X., Li, H., Chan, T. H., Xiao, W., Yuan, M., Huang, Y., Chen, L., & Yu, B. (2025). HDLdebugger: Streamlining HDL debugging with large language models. ACM Transactions on Design Automation of Electronic Systems. https://doi.org/10.1145/3735638