2023-01-06
https://mathr.co.uk/attributive-machine-learning
Machine learning algorithms often put the AI into PLAIGIARISM.
This repository contains machine learning algorithms that properly attribute the sources used for the output.
Browse at https://code.mathr.co.uk/attributive-machine-learning.
Download with git
:
git clone https://code.mathr.co.uk/attributive-machine-learning.git
Code is implemented in the C, Haskell, JavaScript, Lua, and Python programming languages, with support code in Bash and Make.
A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
The chain is constructed by analysing a source corpus to construct the probability tables for the next token in each state (determined by the previous tokens). The process starts from a prompt, each following tokens are determined by weighted random choice given the current context.
The source corpus is made of many files, attribution takes the form of listing how much each source file influenced the choice of each output token. Selecting text in the output HTML shows the corresponding attribution (requires JavaScript).
This implementation uses tokens of a single character (Unicode code points for the Haskell version, bytes for the Lua version). The Lua version is much faster than the Haskell version and uses much less memory.
source corpus github.com/agraef/pd-lua/examples/*.pd_lua
at commit f07953b4f7586d936e57a437ed9f66af8240a839
, prompt pd.Class:new
: examples/pd-lua-examples.html.
source corpus The Adventures Of Sherlock Holmes
(12 stories), order 8, censored to remove references to Sherlock
and Holmes
, prompt tective.
: examples/the-case-of-the-missing-detective.html
Lua version:
First generate chain.lua
from sources, order
is an integer (e.g. 8):
lua attributive-markov-chain.lua build chain.lua order source ...
Then optionally censor it (modifies in place) to forbid certain words or phrases (censor
order can be smaller than build
order):
lua attributive-markov-chain.lua censor chain.lua order forbidden ...
And prune it (modifies in place) to remove dead ends:
lua attributive-markov-chain.lua prune chain.lua
Finally generate text (the prompt should be the same length as the order used to build the chain):
lua attributive-markov-chain.lua generate chain.lua prompt > output.html
Also works with luajit
.
Haskell version:
make attributive-markov-chain
./attributive-markov-chain "prompt" source ... > output.html
You may want to cd
to the directory containing your sources first, otherwise long path names may be included in the output (causing both size and privacy issues). The generated output.html
expects the JavaScript file attributive-markov-chain.js
to be adjacent to it.
Note: sources must be UTF-8 text, you may use iconv
to convert encodings.
Artificial neural networks are computing systems inspired by biological brains. A neural network is based on a collection of connected neurons, each of which can transmit a signal to other neurons. The signal at a connection is a real number, and the output of each neuron is computed by some non-linear function of the weighted sum of its inputs. The weights, which are adjusted during learning, increase or decrease the strength of the signals arriving at the neurons. Typically, neurons are aggregated into layers. Signals travel from the input layer, to the output layer, possibly through multiple hidden layers in between.
– adapted from https://en.wikipedia.org/wiki/Artificial_neural_network
Download the data set I created from: data/genres.data.gz
Python version:
gunzip -k data/genres.data.gz
python3 attributive-neural-network.py --train data/genres.data genres/
python3 attributive-neural-network.py --classify genres/network.npz data/genres.data > genres.html
To generate your own data set, you need rhythm-analysis
from Disco (browse at https://code.mathr.co.uk/disco) that is assumed to be findable via the system PATH
environment variable:
git clone https://code.mathr.co.uk/disco.git
make -C disco rhythm-analysis
You also need a collection of music in MP3 format. Tracks should be stored with paths like label/release/track.mp3
. The tracks of each release are decoded to perceptually normalized WAV with ffmpeg
, and analysed with rhythm-analysis
. The results are concatenated into output files per genre, which are then randomly sampled into a smaller corpus. This is all orchestrated by data/genres.sh
If you don’t have a collection of music, you can download some from the Internet Archive netlabels collections using unarchive_collection
from Unarchive (homepage at https://mathr.co.uk/unarchive).
Attributive Machine Learning
Copyright (C) 2022,2023 Claude Heiland-Allen
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see https://www.gnu.org/licenses/.