Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for slow queries #6267

Open
1 of 3 tasks
gabina opened this issue Mar 23, 2025 · 9 comments
Open
1 of 3 tasks

Improve performance for slow queries #6267

gabina opened this issue Mar 23, 2025 · 9 comments
Labels
performance ruby Pull requests that update Ruby code

Comments

@gabina
Copy link
Member

gabina commented Mar 23, 2025

We enabled the slow query log feature on the outreachdashboard server to identify and optimize queries that take excessively long to execute. Currently, the log captures queries that run for more than one minute.
The plan is to select a query, analyze it, optimize its performance, and submit a pull request with the improvements.

  • Alert monitor queries.
SELECT articles_courses.* FROM articles_courses INNER JOIN articles ON articles.id = articles_courses.article_id WHERE articles.title IN ('1946_Manchester_Borough_Council_election', ...) AND articles.wiki_id = 1;

Done on PR #6260

  • AverageViewsImporter#update_outdated_average_views query
# Time: 250319 19:09:17
# User@Host: outreachdashboard[outreachdashboard] @ peony-sidekiq-medium.globaleducation.eqiad1.wikimedia.cloud [172.16.6.194]
# Thread_id: 627  Schema: dashboard  QC_hit: No
# Query_time: 159.906835  Lock_time: 0.000077  Rows_sent: 85  Rows_examined: 15843645
# Rows_affected: 0  Bytes_sent: 10693
SET timestamp=1742411357;
SELECT DISTINCT `articles`.* FROM `articles` INNER JOIN `articles_courses` ON `articles`.`id` = `articles_courses`.`article_id` WHERE `articles_courses`.`course_id` = 22647 AND (`articles`.`average_views_updated_at` IS NULL OR average_views_updated_at < '2025-03-05 19:09:17.908465') ORDER BY `articles`.`id` ASC LIMIT 200;
  • This query seems to be to calculate all the data for the Miscellanea campaign. We should better investigate where it comes from.
# Time: 250324 18:07:04
# User@Host: outreachdashboard[outreachdashboard] @ peony-web.globaleducation.eqiad1.wikimedia.cloud [172.16.6.223]
# Thread_id: 1191  Schema: dashboard  QC_hit: No
# Query_time: 164.954211  Lock_time: 0.000501  Rows_sent: 122535  Rows_examined: 688342
# Rows_affected: 0  Bytes_sent: 396367704
SET timestamp=1742839624;
SELECT `courses_users`.`id` AS t0_r0, `courses_users`.`created_at` AS t0_r1, `courses_users`.`updated_at` AS t0_r2, `courses_users`.`course_id` AS t0_r3, `courses_users`.`user_id` AS t0_r4, `courses_users`.`character_sum_ms` AS t0_r5, `courses_users`.`character_sum_us` AS t0_r6, `courses_users`.`revision_count` AS t0_r7, `courses_users`.`assigned_article_title` AS t0_r8, `courses_users`.`role` AS t0_r9, `courses_users`.`recent_revisions` AS t0_r10, `courses_users`.`character_sum_draft` AS t0_r11, `courses_users`.`real_name` AS t0_r12, `courses_users`.`role_description` AS t0_r13, `courses_users`.`total_uploads` AS t0_r14, `courses_users`.`references_count` AS t0_r15, `users`.`id` AS t1_r0, `users`.`username` AS t1_r1, `users`.`created_at` AS t1_r2, `users`.`updated_at` AS t1_r3, `users`.`trained` AS t1_r4, `users`.`global_id` AS t1_r5, `users`.`remember_created_at` AS t1_r6, `users`.`remember_token` AS t1_r7, `users`.`wiki_token` AS t1_r8, `users`.`wiki_secret` AS t1_r9, `users`.`permissions` AS t1_r10, `users`.`real_name` AS t1_r11, `users`.`email` AS t1_r12, `users`.`onboarded` AS t1_r13, `users`.`greeted` AS t1_r14, `users`.`greeter` AS t1_r15, `users`.`locale` AS t1_r16, `users`.`chat_password` AS t1_r17, `users`.`chat_id` AS t1_r18, `users`.`registered_at` AS t1_r19, `users`.`first_login` AS t1_r20, `courses`.`id` AS t2_r0, `courses`.`title` AS t2_r1, `courses`.`created_at` AS t2_r2, `courses`.`updated_at` AS t2_r3, `courses`.`start` AS t2_r4, `courses`.`end` AS t2_r5, `courses`.`school` AS t2_r6, `courses`.`term` AS t2_r7, `courses`.`character_sum` AS t2_r8, `courses`.`view_sum` AS t2_r9, `courses`.`user_count` AS t2_r10, `courses`.`article_count` AS t2_r11, `courses`.`revision_count` AS t2_r12, `courses`.`slug` AS t2_r13, `courses`.`subject` AS t2_r14, `courses`.`expected_students` AS t2_r15, `courses`.`description` AS t2_r16, `courses`.`submitted` AS t2_r17, `courses`.`passcode` AS t2_r18, `courses`.`timeline_start` AS t2_r19, `courses`.`timeline_end` AS t2_r20, `courses`.`day_exceptions` AS t2_r21, `courses`.`weekdays` AS t2_r22, `courses`.`new_article_count` AS t2_r23, `courses`.`no_day_exceptions` AS t2_r24, `courses`.`trained_count` AS t2_r25, `courses`.`cloned_status` AS t2_r26, `courses`.`type` AS t2_r27, `courses`.`upload_count` AS t2_r28, `courses`.`uploads_in_use_count` AS t2_r29, `courses`.`upload_usages_count` AS t2_r30, `courses`.`syllabus_file_name` AS t2_r31, `courses`.`syllabus_content_type` AS t2_r32, `courses`.`syllabus_file_size` AS t2_r33, `courses`.`syllabus_updated_at` AS t2_r34, `courses`.`home_wiki_id` AS t2_r35, `courses`.`recent_revision_count` AS t2_r36, `courses`.`needs_update` AS t2_r37, `courses`.`chatroom_id` AS t2_r38, `courses`.`flags` AS t2_r39, `courses`.`level` AS t2_r40, `courses`.`private` AS t2_r41, `courses`.`withdrawn` AS t2_r42, `courses`.`references_count` AS t2_r43 FROM `courses_users` LEFT OUTER JOIN `users` ON `users`.`id` = `courses_users`.`user_id` LEFT OUTER JOIN `courses` ON `courses`.`id` = `courses_users`.`course_id` WHERE `courses_users`.`course_id` IN (SELECT `courses`.`id` FROM `courses` INNER JOIN `campaigns_courses` ON `courses`.`id` = `campaigns_courses`.`course_id` WHERE `campaigns_courses`.`campaign_id` = 2 AND `courses`.`private` = FALSE) AND `courses_users`.`role` = 0 ORDER BY `courses_users`.`revision_count` DESC;
@gabina gabina added performance ruby Pull requests that update Ruby code labels Mar 23, 2025
@Abishekcs
Copy link
Contributor

Working on this.

@Abishekcs
Copy link
Contributor

Update: I’ve found a solution for the second issue. I’ll explain the approach later and then open a PR. Currently testing a few more things.

@Abishekcs
Copy link
Contributor

My Local DB size:

  • aritcles table size - 147338 rows
  • articles_courses table size - 64764 rows
  • Current SQL Query time: : 200 rows in set (0.601 sec)

Image

  • Optimized SQL Query Time: : 200 rows in set (0.002 sec)

Image

Detail Explanation and resptive SQL querys are given below


Query Optimization Report

Why the Current SQL Query is Slow

Current SQL Query:

SELECT DISTINCT articles.*  
FROM articles  
INNER JOIN articles_courses  
ON articles.id = articles_courses.article_id  
WHERE articles_courses.course_id = 22647  
AND (articles.average_views_updated_at IS NULL  
     OR articles.average_views_updated_at < '2025-03-05')  
ORDER BY articles.id ASC  
LIMIT 200;

Issues Identified:

1. Index Misuse

  • The WHERE clause (line 5 of Current SQL Query) searches the articles_courses table using course_id, which has an index.

  • However, the EXPLAIN EXTENDED statement (screenshot 1) shows that the query is using index_articles_courses_on_article_id instead.

- screenshot1

Image

  • This happens because of cardinality differences between course_id and article_id, as shown in (screenshot 2).

- screenshot2

Image

2. Temporary Table Usage

  • The EXPLAIN statement (screenshot 1) also shows Using temporary, which is caused by DISTINCT in line 1.

  • This extra processing step can be avoided using a composite index, which will also remove the need for ORDER BY since the index is already sorted.

3.Cardinality & Sorting Issue

  • screenshot 2 reveals that the cardinality of course_id is very low compared to article_id.

  • MariaDB chooses article_id for indexing instead of course_id, which will lead to MySQL selecting unnecessary rows that are not required.

4. Inefficient Loop Execution

  • screenshot 3 (ANALYZE statement) shows:

- screenshot3

Image

  • Estimated rows (rows): 32,311

  • Actual rows read (r_rows): 64,764 --> Bad performance

  • screenshot 4 (ANALYZE FORMAT=JSON) shows:

- screenshot4

Image

  • Outer loop (r_loops for articles_courses) = 1

  • Inner loop (r_loops for articles) = 62,897 Very inefficient

  • The query loops (1 * 62,897) times, but in the optimized version, it only runs (1 * 200) times due to the composite index.


Solution: Use a Composite Index on (course_id, article_id)

Optimized SQL Query:

SELECT articles.*  
FROM articles  
INNER JOIN articles_courses  
ON articles.id = articles_courses.article_id  
WHERE articles_courses.course_id = 10027  
AND (articles.average_views_updated_at IS NULL  
     OR articles.average_views_updated_at < '2025-03-05')  
LIMIT 200;

Why This Works:

  1. Efficient Index Usage

    • MySQL now uses the composite index (idx_course_article).

    • screenshot 5 shows high cardinality in course_id, article_id sequence(1+2), making it much faster.

    - screenshot5

Image

  1. Ref Type Instead of Range

    • screenshot 6 shows that the query type changed from range to ref, meaning MySQL can quickly find rows

- screenshot6

Image

instead of scanning. Below is a roguh sketch how it happens in B-tree.

Image

  • Ref type is better:

    • Before: MySQL searched inefficiently.

    • After: It directly finds matching rows using the composite index.

3.Avoids Sorting & Temporary Tables

  • ORDER BY is no longer needed since id is already sorted in the composite index.

  • DISTINCT is also no longer needed, as duplicate values are avoided.

4.Huge Performance Improvement

  • screenshot 7 (ANALYZE statement) now shows:

- screenshot7

Image

  • Estimated rows (rows): 32,311

  • Actual rows read (r_rows): 200 Big improvement!

  • screenshot 8 (ANALYZE FORMAT=JSON) now shows:

- screenshot8

Image

  • Outer loop (r_loops for articles_courses) = 1

  • Inner loop (r_loops for articles) = 200 Much better than 62,897!

@Abishekcs
Copy link
Contributor

Abishekcs commented Mar 26, 2025

Hi @gabina, whenever you have time, could you please take a look at the Query Optimization Report? I’d really appreciate your feedback and would love to know if this will be helpful. Thank you!

@ragesoss
Copy link
Member

@Abishekcs thanks! This looks very promising. Is there a straightforward way to implement it?

@Abishekcs
Copy link
Contributor

@Abishekcs thanks! This looks very promising. Is there a straightforward way to implement it?

  • just composite index on courses_id and article_id of articles_courses table should do it.
  • and how we query this through ORM of rails.

@Abishekcs
Copy link
Contributor

Abishekcs commented Mar 26, 2025

I tried other simpler methods, but they became more complex and started to confuse me (Also, sometimes there result was not deterministic). The composite index seems simpler, with the only trade-off being that it requires extra space and may slightly increase insertion and update time.

@ragesoss
Copy link
Member

Cool. Feel free to open a PR with the migration to add the index. I will need to make a copy of the DB and test how long the migration takes with a large ArticlesCourses table, to make sure it's safe to migrate in production.

@Abishekcs
Copy link
Contributor

This query seems to be to calculate all the data for the Miscellanea campaign. We should better investigate where it comes from.

I'll try to figure this one out too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance ruby Pull requests that update Ruby code
Projects
None yet
Development

No branches or pull requests

3 participants