DeepSeek vs. ChatGPT: Which Performs Better in Python Coding?

Abdalla, Rania A. M.

doi:10.22059/jitm.2026.107165

DeepSeek vs. ChatGPT: Which Performs Better in Python Coding?

Document Type : Research Paper

Author

Rania A. M. Abdalla

Department of Information Technology, Palestine Technical University, Kadoorie, Palestine.

10.22059/jitm.2026.107165

Abstract

This paper conducts a comparative evaluation of two advanced large language models (LLMs) — ChatGPT-4 and DeepSeek v3—utilizing 80 algorithmic problems from Code forces categorized into four difficulty levels: Easy (800–1100), Intermediate (1200–1600), Advanced (1700–2000), and Expert (2100–2400), focusing on code generation in Python. Standardized prompts and controlled testing conditions enable the assessment of models on accuracy, effi-ciency, and code readability. As the complexity of issues increases, DeepSeek frequently out-performs ChatGPT in both accuracy and efficiency, despite both models excelling in simpler tasks. This, however, results in reduced code clarity and increased memory use. While less pre-cise at elevated levels, ChatGPT produces more concise and idiomatic responses. Both models had limited competence at the expert level; however, DeepSeek-R1 indicated a slight edge. The study illustrates a trade-off between accuracy and code clarity, so as to inform the selection of LLMs based on task requirements and provide a foundation for future efforts in optimizing code generation models for actual applications.

Keywords

References

Anand, A., Gupta, A., Yadav, N., & Bajaj, S. (2024). A comprehensive survey of AI-driven advancements and techniques in automated program repair and code generation. arXiv. http://arxiv.org/abs/2411.07586

Buscemi, A. (2023). A comparative study of code generation using ChatGPT 3.5 across 10 programming languages. arXiv. http://arxiv.org/abs/2308.04477

Cambaz, D., & Zhang, X. (2024). Use of AI-driven code generation models in teaching and learning programming: A systematic literature review. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education (SIGCSE 2024) (Vol. 1, pp. 172–178).

Chen, X., Liu, C., & Song, D. (2018). Tree-to-tree neural networks for program translation.

Chung, D. J. H., Gao, Z., Kvasiuk, Y., Li, T., Münchmeyer, M., Rudolph, M., Sala, F., & Tadepalli, S. C. (2025). Theoretical physics benchmark (TPBench): A dataset and study of AI reasoning capabilities in theoretical physics. arXiv. http://arxiv.org/abs/2502.15815

Cruz-Benito, J., Vishwakarma, S., Martin-Fernandez, F., & Faro, I. (2021). Automated source code generation and auto-completion using deep learning: Comparing and discussing current language model-related approaches. AI (Switzerland), 2(1), 1–16.

Dou, S., Jia, H., Wu, S., Zheng, H., Zhou, W., Wu, M., Chai, M., Fan, J., Huang, C., Tao, Y., Liu, Y., Zhou, E., Zhang, M., Zhou, Y., Wu, Y., Zheng, R., Wen, M., Weng, R., Wang, J., … Huang, X. (2024). What’s wrong with your code generated by large language models? An extensive study. arXiv. http://arxiv.org/abs/2407.06153

Fan, Z., Gao, X., Mirchev, M., Roychoudhury, A., & Tan, S. H. (2023). Automated repair of programs from large language models. arXiv. http://arxiv.org/abs/2205.10583

Gao, C., Hu, X., Gao, S., Xia, X., & Jin, Z. (2025). The current challenges of software engineering in the era of large language models. ACM Transactions on Software Engineering and Methodology, 34(5).

Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., & Wang, H. (2024). Large language models for software engineering: A systematic literature review. arXiv. http://arxiv.org/abs/2308.10620

Huang, H., Wang, S., Liu, H., Wang, H., & Wang, Y. (2024). Benchmarking large language models on communicative medical coaching: A dataset and a novel system. In Findings of the Association for Computational Linguistics: ACL 2024 (pp. 1624–1637). https://aclanthology.org/2024.findings-acl.94.pdf

Jiang, J., Wang, F., Shen, J., Kim, S., & Kim, S. (2018). A survey on large language models for code generation. arXiv. http://arxiv.org/abs/2406.00515

Jiménez, Á. B. (2024). An evaluation of LLM code generation capabilities through graded exercises. arXiv. http://arxiv.org/abs/2410.16292

Joshi, S. (2025). A comprehensive review of DeepSeek: Performance, architecture and capabilities. Preprints.

Karlsson, A. (2024). Evaluating programming proficiency of large language models—Assessing large language models’ effectiveness in function and class generation, code commenting, robustness, and security [Master’s thesis, Linköping University]. https://www.divaportal.org/smash/get/diva2:1877998/FULLTEXT01.pdf

Ladegaard, I. (2025). Differentiation by disruption: Gatekeeper perspectives on “AI-aided writing” in three academic disciplines. Socius, 11.

Le, H. (2024). The evolving role of programmers in an AI-chatbot dominated world: Challenges, adaptation strategies, and future prospects [Doctoral dissertation, University of the Cumberlands]. https://www.proquest.com/openview/1e11d675b13b0ff2df4c28b2fda4c53a/1.pdf

Li, M., & Krishnamachari, B. (2024). Evaluating ChatGPT-3.5 efficiency in solving coding problems of different complexity levels: An empirical analysis. arXiv. http://arxiv.org/abs/2411.07529

Manik, M. M. H. (2025). ChatGPT vs. DeepSeek: A comparative study on AI-based code generation.

Mulder, R., Aivaloglou, F., & Zhang, X. (2023). AI in coding: How can code generation models support developing computational thinking skills? The use of code generation models in programming support activities. http://repository.tudelft.nl/

Shakya, R., Vadiee, F., & Khalil, M. (2025). A showdown of ChatGPT vs DeepSeek in solving programming tasks. In International Conference on New Trends in Computing Sciences (pp. 413–418). IEEE. https://arxiv.org/pdf/2503.13549

Shi, L., Tang, Z., Zhang, N., Zhang, X., & Yang, Z. (2024). A survey on employing large language models for text-to-SQL tasks. ACM Computing Surveys, 58(2).

Tang, X., Qian, B., Gao, R., Chen, J., Chen, X., & Gerstein, M. B. (2024). BioCoder: A benchmark for bioinformatics code generation with large language models. Bioinformatics, 40(Supplement_1), i266–i276.

Wang, X., Gong, Z., Wang, G., Jia, J., Xu, Y., Zhao, J., Fan, Q., Wu, S., Hu, W., & Li, X. (2023). ChatGPT performs on the Chinese National Medical Licensing Examination.

Xu, H., & Yu, X.-Y. (2025). From PowerPoint UI sketches to web-based applications: Pattern-driven code generation for GIS dashboard development using knowledge-augmented LLMs, context-aware visual prompting, and the React framework. http://arxiv.org/abs/2502.08756

Yao, X., Li, H., Chan, T. H., Xiao, W., Yuan, M., Huang, Y., Chen, L., & Yu, B. (2025). HDLdebugger: Streamlining HDL debugging with large language models. ACM Transactions on Design Automation of Electronic Systems. https://doi.org/10.1145/3735638

Journal of Information Technology Management

DeepSeek vs. ChatGPT: Which Performs Better in Python Coding?

References

Volume 18, Issue 2
2026
Pages 1-27

Files

Share

How to cite

Statistics

DeepSeek vs. ChatGPT: Which Performs Better in Python Coding?

References

Volume 18, Issue 2 2026Pages 1-27

Files

Share

How to cite

Statistics

Volume 18, Issue 2
2026
Pages 1-27