Consider using github-to-sqlite to grab our activity dataset

### Context

This tool is basically a two-step process.

- In step 1, we grab as much data as we can about issues/PRs/comments/etc using the GitHub API. This is done with:
  - This function: https://github.com/executablebooks/github-activity/blob/c149ba057f813ae6b632ac47b4f82d2f72883e98/github_activity/github_activity.py#L64-L95
  - And this GraphQL query: https://github.com/executablebooks/github-activity/blob/master/github_activity/graphql.py
- In step 2, we parse the resulting data, munge it, and output markdown, statistics, etc.

However, the functionality in step 1 is kind-of hacky and messy, and hard to reason with.

I recently came across a tool recommended by @simonw , which essentially replicates all of this functionality but with a more well-structured and maintainer implementation:

- https://github.com/dogsheep/github-to-sqlite

This is a python library that will grab **all** of the issues, pull requests, and comments (among other things) from a repository and store them in a local sqlite database so that you can do what you want with them. They are structured to be able to work with [datasette](https://datasette.io/) as well (though we may not have use for that in this package, just FYI).

Two questions that I have and I'm not sure the answer:

- How to speed it up. I'm not sure whether github-to-sqlite does any _cacheing_ or allows you to _filter by date_. If not, then it might take quite a long time to run this interactively.
- How to run via a Python API. All the examples use a CLI, and while this is probably fine it would be nice if we could grab / update datasets by running this as part of other scripts.

### Proposal

What do folks think about re-using `github-to-sqlite` for our "grab all of the activity in a repository" step, and focusing *this* repository on the munging / filtering by date / calculating statistics / generating markdown aspects?

I think this might be a nice way to reduce some unnecessary complexity here and to re-use code from others in the ecosystem. I also like the idea of becoming familiar with datasette structures as is opens the possibility that we could expose this kind of data in the future for others in the community to munge and use.

At this point I'm just exploring the idea and curious what others think!

### Tasks and updates

_No response_

	def get_activity(
	target, since, until=None, repo=None, kind=None, auth=None, cache=None
	):
	"""Return issues/PRs within a date window.

	Parameters
	----------
	target : string
	The GitHub organization/repo for which you want to grab recent issues/PRs.
	Can either be just and organization (e.g., `jupyter`) or a combination
	organization and repo (e.g., `jupyter/notebook`). If the former, all
	repositories for that org will be used. If the latter, only the specified
	repository will be used.
	since : string \| None
	Return issues/PRs with activity since this date or git reference. Can be
	any string that is parsed with dateutil.parser.parse.
	until : string \| None
	Return issues/PRs with activity until this date or git reference. Can be
	any string that is parsed with dateutil.parser.parse. If none, today's
	date will be used.
	kind : ["issue", "pr"] \| None
	Return only issues or PRs. If None, both will be returned.
	auth : string \| None
	An authentication token for GitHub. If None, then the environment
	variable `GITHUB_ACCESS_TOKEN` will be tried. If it does not exist,
	then attempt to infer a token from `gh auth status -t`.
	cache : bool \| str \| None
	Whether to cache the returned results. If None, no caching is
	performed. If True, the cache is located at
	~/github_activity_data. It is organized as orgname/reponame folders
	with CSV files inside that contain the latest data. If a string it
	is treated as the path to a cache folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider using github-to-sqlite to grab our activity dataset #76

Context

Proposal

Tasks and updates

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider using github-to-sqlite to grab our activity dataset #76

Description

Context

Proposal

Tasks and updates

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions